• 1. Teradata数据库简介Teradata数据仓库事业部 华南区
  • 2. Agenda关于TERADATA Teradata数据库原理 Teradata数据库架构 Teradata数据库工作原理 Teradata特性 Teradata数据仓库构建 基本概念 常用工具介绍 管理的一些约定
  • 3. 关于TERADATATeradata最初产生于1976年,由加州理工学院和花旗银行的高科技项目-创建一个能够分析10的12次方 bytes 数据的系统。 1 Kilobyte = 103 = 1000 bytes 1 Megabyte = 106 = 1,000,000 bytes 1 Gigabyte = 109 = 1,000,000,000 bytes 1 Terabyte = 1012 = 1,000,000,000,000 bytes 1 Petabyte = 1015 = 1,000,000,000,000,000 bytes 1 Exabyte = 1018 = 1,000,000,000,000,000,000 bytes 1 Zetabyte = 1021 = 1,000,000,000,000,000,000,000 bytes 1 Yottabyte = 1024 = 1,000,000,000,000,000,000,000,000 bytes
  • 4. 关于TERADATATeradata 是全球最大的专注于数据仓库、咨询服务及企业分析方案的提供商,凭借业界领先的数据库、数据仓库解决方案、性能卓越的可扩展平台以及全球2000多个大型数据仓库项目的客户成功经验,成就了公司在数据仓库领域的创新领导地位。
  • 5. ability to executecompleteness of visionniche playersvisionariesDATAllegroGreenplumNetezzaMySQLKognitioSand TechnologySybaseMicrosoftIBMTeradataOracleleaderschallengersas of September 200720072008ability to executecompleteness of visionSybaseOracleIBMTeradataMicrosoft-DATAllegroNetezzaGreenplumHPSun Microsystems-MySQLKognitioSand TechnologyVerticaIngresIlluminate Solutions1010dataniche playersvisionariesleaderschallengersas of December 2008Gartner 评选Teradata为数据仓库领导厂商 5
  • 6. Teradata数据库原理Teradata数据库架构 Teradata数据库工作原理 Teradata特性
  • 7. 通 信 界 面 存 取 模 块 单 元 AMP 分 析 引 擎 PEPE1PE2AMP1AMP2AMP4AMP3 网 关 通 道 单 结 点 SMPUNIXPDE (并 行 数 据 库 扩 展 环 境)TCP/IP 封 闭 主 机 系 统 信 息 传 递 层 磁盘阵列Teradata数据库架构 BYNET信息传递网络
  • 8. TERADATA的MPP架构CPUCPUCPUCPUMemoryCPUCPUCPUCPUMemoryMPP系统与Teradata 多结点同时工作 数据库由各结点共同拥有MPP (Massive Parallel Processing) 海量并行处理服务器:由多个SMP服务器通过一定的结点互联网络进行连接,协同工作,完成相同的任务。从用户的角度来看是一个系统!
  • 9. TERADATA的MPP架构高速并行处理的BYNET技术 更适合复杂/即席查询 通过硬件技术实现Teradata的多个子任务均衡并行SMP 处理器节点光纤I/O连接MPP 数据高速交换层BYNET2DA Controllers(w/Cache)BYNET高性能磁盘阵列DA Controllers(w/Cache)DA Controllers(w/Cache)DA Controllers(w/Cache) Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU MemoryCLIQUE
  • 10. PARSING ENGINE (PE) SQL Parser & Optimizer Query Step Dispatcher Network Distribution Access Module Processors (AMP) Disk PartitionsV-PE BY-NetV-AMPV-AMPV-AMPV-AMPV-PETeradata并行处理架构
  • 11. AMP4的数据AMP3的数据AMP2的数据每个并行单元只管理自己的数据Building IndexesBackup & RecoveryRow LockingTransaction JournalizingSortingReading WritingLoadingAggregatingAMPAMPAMPAMP1的数据AMP1BYNETTeradata并行的机制
  • 12. 其他关系数据库 “有条件的并行”Teradata “无条件的并行”时间初始查询查询优化查询并行扫描链接聚合排序收敛最终结果集并行处理性能
  • 13. 线性扩展能力 最大化的利用每个节点的资源 可灵活配置VPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsVPROCs AmpsShared Nothing Software
  • 14. MPP小结Teradata MPP 架构 使用当前最快的CPU 最好的扩展性 使用 shared-nothing MPP架构以达到线性扩展
  • 15. Teradata数据仓库构建基本概念 常用工具介绍 管理的一些约定
  • 16. 数据处理的演变
  • 17. 数据仓库(Data Warehouse,可简写为DW )数据仓库是决策支持系统(DSS)和联机分析(OLAP)应用数据源的结构化数据环境。数据仓库研究和解决从数据库中获取信息的问题。数据仓库的特征在于面向主题、集成性、稳定性和时变性。 Operational DataData WarehouseAccess ToolsEnd UsersATMPeople SoftPOSCognosAccessBiz ObjectsTeradata RDBMS
  • 18. ETLETL是Extraction-Transformation-Loading的缩写,负责将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成,最后加载到数据仓库或数据集市中,成为联机分析处理、数据挖掘的基础。
  • 19. ETL ETL是构建数据仓库的重要一环,用户从数据源抽取出所需的数据,经过数据清洗,最终按照预先定义好的数据仓库模型,将数据加载到数据仓库中去。
  • 20. 主索引是表中的一个或多个字段,用于确定数据的物理分布 每个表的数据根据 PI (主索引)平均分布在不同的AMP 通过Hash算法实现数据自动分布 无需数据重组、重新分区、数据分布管理 可以是唯一或非唯一 一个表不会有两个主索引 主索引的选择,关系到能否很好的发挥Teradata数据库的优势-并行处理。 VAMP1 VAMP2 VAMP3 VAMP4 ………………………………………………………VAMPnTable 1 Table 2 Table 3Primary IndexTeradata Parallel Hash Function主索引(Primary Index)PI:cust_idPI:cust_idPI:acc_id
  • 21. 主键和主索引Indexes are conceptually different from keys. A PK is a relational modeling convention which allows each row to be uniquely identified. A PI is a Teradata convention which determines how the row will be stored and accessed. A significant percentage of tables may use the same columns for both the PK and the PI. A well-designed database will use a PI that is different from the PK for some tables. Primary Key Primary Index Logical concept of data modeling Physical mechanism for access and storage Teradata doesn’t need to recognize Each table must have exactly one primary index No limit on number of columns 16 column limit (V2R4.1); 64 column limit (V2R5 …) Documented in data model Defined in CREATE TABLE statement (Optional in CREATE TABLE) Must be unique May be unique or non-unique Identifies each row Identifies 1 (unique) or multiple rows (non-unique) Values should not change Values may be changed (Delete + Insert) May not be NULL – requires a value May be NULL Does not imply an access path Defines most efficient access path Chosen for logical correctness Chosen for physical performance
  • 22. Accessing via a Unique Primary IndexAMP 1AMP 2AMP 3AMP 4CustNamePhoneUPIBase TableCustNamePhoneUPIBase TableAdams111-222231Smith555-666674Brown333-999998Young777-444412CustNamePhoneUPIBase TableAdams444-666645Smith111-666649Smith555-777756Marsh888-222251Rice666-555584Black444-555562Jones777-666677Peters555-777795CustNamePhoneUPIBase TableSmith222-333340Adams666-777772Jones222-888827White555-444437CREATE TABLE Customer (Cust INT ,Name CHAR(10) ,Phone CHAR(8) ) UNIQUE PRIMARY INDEX (Cust);SELECT * FROM customer WHERE cust = 45;PEUPI = 45Hashing AlgorithmBYNETCustNamePhoneCUSTOMER table 37 White 555-4444 98 Brown 333-9999 74 Smith 555-6666 95 Peters 555-7777 27 Jones 222-8888 56 Smith 555-7777 45 Adams 444-6666 84 Rice 666-5555 49 Smith 111-6666 51 Marsh 888-2222 31 Adams 111-2222 62 Black 444-5555 12 Young 777-4444 77 Jones 777-6666 72 Adams 666-7777 40 Smith 222-3333UPIPK
  • 23. Row Distribution Using a UPIAMP 1AMP 2AMP 3AMP 4720224/09C740234/16C732524/13C722524/15C718814/13C738414/12C732434/13C710314/10COrder741514/13CThe PK column(s) will often be used as a UPI. PI values for Order_Number are known to be unique (it’s a PK). Teradata will distribute different index values evenly across AMPs. Resulting row distribution among AMPs is uniform.
  • 24. Row Distribution Using a NUPIOrder722524/15C732524/130741514/13C738414/12C732434/130740234/16C710314/10CAMP 1AMP 2AMP 4720224/09C718814/13CAMP 3Customer_Number may be the referred access column for ORDER table, thus a good index candidate. Values for Customer_Number are non-unique and therefore a NUPI. Rows with the same PI value distribute to the same AMP causing row distribution to be less uniform or skewed.
  • 25. Row Distribution Using a Highly Non-Unique Index OrderAMP 1AMP 2AMP 4732524/130732434/130710314/10O732524/130722524/15C741514/13C738414/12C740234/16C720224/09C718814/13CAMP 3Values for Order_Status are highly non-unique. Only two values exist, so only two MPs will be used in this table. This table will not perform well in parallel operations. Highly non-unique columns are poor PI choices. The degree of uniqueness is critical to efficiency.
  • 26. Partitioned Primary Index 4 AMPs with Orders Table Defined with NPPI4 AMPs with Orders Table Defined with PPI on O_Date
  • 27. Secondary IndexesA secondary index is an alternate path to the rows of a table. A table can have from 0 to 32 secondary indexes. Secondary indexes: Do not affect table distribution. Add overhead, both in terms of disk space and maintenance. May be added or dropped dynamically as needed. Are chosen to improve table performance.
  • 28. Full Table ScansSELECT * FROM customer WHERE Cust_Phone LIKE ‘524-’; SELECT * FROM customer WHERE Cust_Name <> ‘Davis’; SELECT * FROM customer WHERE Cust_ID > 1000;Examples of Full-Table Scans:Every row of the table must be read. All AMPs scan their portion of the table in parallel. Primary Index choice affects FTS performance. Full-table scans typically occur when either: The index columns are not used in the query An index is used in a non-equality test A range of values is specified for the primary indexCUSTOMERCust_IDCust_NameCust_PhoneUSINUPINUSI
  • 29. Query Submitting ToolsBTEQ Basic Teradata Query utility Report writing and formatting features Interactive and batch queries Import/Export across all platforms
  • 30. FastLoadFast batch mode utility for loading new tables onto the Teradata database Can reload previously emptied tables Full Restart capability Error Limits and Error Tables, accessible using SQL Restartable INMOD routine capability Ability to load data in several stages Teradata Database HostFastLoad
  • 31. FastLoad CharacteristicsPurposeLoad large amounts of data into an empty table at high speed. Execute from Teradata servers, channel, or network-attached hosts.ConceptsLoads into an empty table with no secondary indexes. Has two phases - creates an error table for each phase. Status of run is displayed. Checkpoints can be taken for restarts.RestrictionsOnly load 1 empty table with 1 FastLoad job. The Teradata Database will accommodate up to 15 FL/ML/FE applications at one time. Tables defined with Referential integrity, secondary indexes, Join Indexes, Hash Indexes, or Triggers cannot be loaded with FastLoad. Tables with Soft Referential Integrity (V2R5) can be loaded with FastLoad. Duplicate rows cannot be loaded into a multiset table with FastLoad. If an AMP goes down, FastLoad cannot be restarted until it is back online.
  • 32. A Sample FastLoad ScriptLOGON tdpid/username,password; DROP TABLE Acct; DROP TABLE AcctErr1; DROP TABLE AcctErr2; CREATE TABLE Acct, FALLBACK ( AcctNum INTEGER ,Number INTEGER ,Street CHAR(25) ,City CHAR(25) ,State CHAR(2) ,Zip_Code INTEGER) UNIQUE PRIMARY INDEX (AcctNum); LOGOFF; LOGON tdpid/username,password; BEGIN LOADING Acct ERRORFILES AcctErr1, AcctErr2 CHECKPOINT 100000; DEFINE in_AcctNum (INTEGER) ,in_Zip (INTEGER) ,in_Nbr (INTEGER) ,in_Street (CHAR(25)) ,in_State (CHAR(2)) ,in_City (CHAR(25)) FILE=data_infile1; INSERT INTO Acct VALUES ( :in_AcctNum ,:in_Nbr ,:in_Street ,:in_City ,:in_State ,:in_Zip); END LOADING; LOGOFF; SETUP Create the table, if it doesn’t already exist. Start the utility. Error files must be defined.Checkpoint is optional.DEFINE the input; must agree with host data format.INSERT must agree with table definition. Phase 1 begins. Unsorted blocks are written to disk.Phase 2 begins with END LOADING. Sorting and writing blocks to disk.
  • 33. MultiLoadBatch mode utility that runs on the host system. FastLoad-like technology – TPump-like functionality. Supports up to five populated tables. Multiple operations with one pass of input files. Conditional logic for applying changes. Supports INSERTs, UPDATEs, DELETEs and UPSERTs; typically with batch inputs from a host file. Affected data blocks only written once. Host and LAN support. Full Restart capability. Error reporting via error tables. Support for INMODs.MULTILOADHOSTTeradata DBUPDATEsINSERTsDELETEsTABLE ATABLE BTABLE CTABLE DTABLE E
  • 34. MultiLoad LimitationsNo data retrieval capability. Concatenation of input data files is not allowed. Host will not process arithmetic functions. Host will not process exponentiation or aggregates. Cannot process tables defined with USI’s, Referential Integrity, Join Indexes, Hash Indexes, or Triggers. Import tasks require use of Primary Index.
  • 35. Basic MultiLoad Statements.LOGTABLE [ restartlog_tablename ] ; .LOGON [ tdpid/userid, password ] ; .BEGIN MLOAD TABLES [ tablename1, ... ] ; .LAYOUT [ layout_name ] ; .FIELD ….. ; .FILLER ….. ; .DML LABEL [ label ] ; .IMPORT INFILE [ filename ] [ FROM m ] [ FOR n ] [ THRU k ] [ FORMAT FASTLOAD | BINARY | TEXT | UNFORMAT | VARTEXT 'c' ] LAYOUT [ layout_name ] APPLY [ label ] [ WHERE condition ] ; .END MLOAD ; .LOGOFF ;.FIELD fieldname { startpos datadesc } || fieldexp [ NULLIF nullexpr ] [ DROP {LEADING / TRAILING } { BLANKS / NULLS } [ [ AND ] {TRAILING / LEADING } { NULLS / BLANKS } ] ] ; .FILLER [ fieldname ] startpos datadesc ;
  • 36. FastExportExports large volumes of formatted data from Teradata to a host file or user-written application. Takes advantage of multiple sessions. Export from multiple tables. Uses Support Environment. Fully automated restart. Uses one of the “Loader” slots. Teradata Database HostFastExport
  • 37. A FastExport Script.LOGTABLE RestartLog1_fxp; .RUN FILE logon ; .SET CityName TO 'Los Angeles'; .SET ZipCode TO 90666; .BEGIN EXPORT SESSIONS 4 ; .EXPORT OUTFILE custacct_data; SELECT A.Account_Number , C.Last_Name , C.First_Name , A.Balance_Current FROM Accounts A INNER JOIN Accounts_Customer AC INNER JOIN Customer C ON C.Customer_Number = AC.Customer_Number ON A.Account_Number = AC.Account_Number WHERE A.City = '&CityName' AND A.Zip_Code = &ZipCode ORDER BY 1 ; .END EXPORT ; .LOGOFF ; Define Restart LogSpecify number of sessionsDestination fileSend request.Terminate sessionsVia a SELECT, specify the columns and rows to export.
  • 38. 其他常用工具Teradata Administrator Teradata SQL Assistant Teradata Manager Teradata Ole Load
  • 39. 管理中的一些约束数据库和用户的管理 数据库对象的管理 数据库空间的管理
  • 40. 数据库和用户的管理 User DBCUser SYSDBAUser AUser C User D用户和数据库分别归属到两个用户下面。 针对不同类型的对象,分别创建库,便于管理。 最好创建一个专门用于存放临时表的库。User BDatabase ADatabase B
  • 41. 数据库对象的管理 对象名称的约束 Prefix_subject_body_suffix_YYYYMMDD 说明:Prefix是表明这个对象的类型,如TB表示对象为表,VW表示对象为视图等等,subject一般是数据模型中所定义的。 例如:TB_cust_cust_his,这是一个客户域里面的客户历史表。 对于用户创建的临时表,最好使用用户名(实名用户)作为前缀,并加上日期后缀,以便在清理空间时确认。 表名最多不要超过32个字符(64位系统将解决这个问题) 对于FastLoad,MultiLoad中定义的表名不要超过25个字符。
  • 42. 数据库空间的管理 制定数据仓库数据存储策略 比如历史表保存多久的数据,明细表和汇总表保存多久的数据 及时清理临时表,长期不使用的表。 定期检查库表的数据分布情况,对分布不均匀的表进行重新分 布。
  • 43. 谢 谢 !