李炜 - Building Next-Generation Data Integration Platform


George Xiong eBay Data Plaorm Architect April 21, 2013 Building Next-Generaon Data Integraon Plaorm eBay Analytics >50 TB/day new data >100 PB/day >100 Trillion pairs of information Millions of queries/day >7500 business users & analysts >60k chains of logic 24x7x365 99.98+% Availability second 100+ Subject Areas Always online Processed YEAR 2012 >1000 Data Source 5000+ Target Data Data Plaorms Data Warehouse + Behavioral Singularity Data Warehouse Semi-structured/ SQL++ Structured/ SQL Low End Enterprise-class System Contextual-Complex Analycs Deep, Seasonal, Consumable Data Sets Producon Data Warehousing Large Concurrent User-base Discover & Explore Analyze & Report 150+ concurrent users 500+ concurrent users Enterprise-class System 5-10 concurrent users Unstructured / JAVA&C Structure the Unstructured Detect Paerns Hadoop Commodity Hardware System EDW Data Integraon Layer Retrospective •  Big Data = Big Systems <> Accurate Data •  Job Complexity •  System Outage / Availability •  High Maintenance Costs •  Quick Delivery Pressure ETL always is the first priority of DI Once upon a time Oracle 1 Oracle 2 Oracle 3 Data Files BU DM 1 TERADATA 1 TERADATA 2 HADOOP Single source extract Basic Reformat Static load Inefficient …Inconsistent … In parallel User unfriendly Next-Gen ETL Requirement Compression Conditional Components Multi-source/Multi-Target Abstraction Platform Cost Efficiently Rapid Development Build-in HA/DR Hyper Reusability High Scalability Single Version Building The Foundation Reusable, metadata driven processes Picking the right tool Think big, implement small, increment later Focus on efficiency where it matters Single Version Utilities Abstraction: Metadata Drives Everything Key Component-DML record decimal(13) id; /* DECIMAL(12) NOT NULL*/ string(2) code; /* CHAR(2) NOT NULL*/ string(2) iso_country; /* CHAR(2) NOT NULL*/ string(1) summertime_ends_first = NULL; /* CHAR(1)*/ decimal(10) summertime_ends_month = NULL; /* DECIMAL(9)*/ decimal(10) default_currency_id = NULL; /* DECIMAL(9)*/ decimal(10) name_res_id = NULL; /* DECIMAL(9)*/ end Environment Setup Common setup script ETL Process Specific Configuration Everything evaluated at run time The Extract Process •  Single common extract handler •  ETL ID specific State files •  Run time metadata •  Single Module extract utility AB Initio Extract Graph The Load Process •  Single common Load handler •  ETL ID specific State files •  Run time metadata •  Single Module load utility •  Multi-Data Target AB Initio Load Graph The Transformation Process •  Typical Run post Load •  Dynamic environment •  Independent SQL or Mapreduce •  Run time Query Band •  Native Integrated The ETL Metadata •  System Capacity/Workload •  Data Lineage •  ETL Job State •  Resource tracking and metrics Other ETL Framework Modules •  Data Move utilities •  Unit of Work •  Data Pipeline •  ETL host Workload balance •  Job Auto Switch •  Auto ETL code smart gen tools •  ELT-> ETL •  … Put It all Together Efficient …Consistent …Configurable… Extensible… Parallel Reusability…Restart ability Data Mover Utility Oracle Teradata HDFS Data File XML Web Logs Oracle Teradata Hadoop EXTRACT •  Single common extract handler •  ETL ID specific State files •  Run me metadata •  Single Module extract ulity LOAD •  Single common Load handler •  ETL ID specific State files •  Run me metadata •  Single Module load ulity •  Mul-Data Target Transform •  Typical Run post Load •  Dynamic environment •  Independent SQL •  Run me Query Band •  Nave Integrated Data File(s) Metadata •  System Capacity/Workload •  Data Lineage •  ETL Job State •  Resource tracking and metrics •  … DI technologies: Not Only ETL Next Generation DI Options Plotted for Growth and Commitment, from TDWI Software-as-a-service (SaaS) >85% of eBay analytical workload is NEW & Unknown The metrics you know are cheap The metrics you don’t know are expensive – but high in potential ROI Exploration & Testing are core pillars of an analytics-driven organization What is a VDM? A Virtual Data Mart (VDM) is a Prototyping or Subject Area Specific Environment in Teradata (formerly called PET). Allows End Users to create a working, non-production environment for: •  One-Time Analytics •  Specific, unique data analysis •  Loading and correlating of data from sources not currently available in the EDW •  Business Unit specific reporting and analysis Metadata Collecting Automation DBQL Table Usage Info Analysis Engine ETL Metadata ETL Metadata Repository ETL JOB Log DBQL/Table Usage Info/ETL JOB LOG are Teradata Dictionary Tables •  DBQL: Contains each query details, such as runtime, CPU cost, query band etc. •  Table Usage Info: What table(s) is been used by the query •  ETL JOB TRACKER Analysis Engine analyze the raw data of DBQL and Table Usage Info, get dependency metadata about table(s) •  On batch script (job)level, what table(s) is output table of the script(job) •  What table(s) is input table of script(job) ETL Metadata contains the result of Analysis Engine, including •  DFD dependency meta data of each table, with the meta data, we could draw DFD for any table via the tool Graphviz. •  Each script(job) is a node of the diagram •  The dependency between script(job) setup the mapping between nodes. Data Lineage 24 Step2: the step number is ordered by the job start time Job Start/End Time(HH:MM:SS) The script(job) name to populate the table in the step The output table of step1, also, it is the input table of step2 Round Corner Rectangle: The upstream tables from other subject area Blue line: Stands for the process critical path Set Background as gray to highlight the target table of the diagram More Data Integration Programs Data Quality Data Rationalization Standardized ETL Building Tools The Datahub … Questions? For More Information: jxiong@ebay.com @InfoQ infoqchina
还剩26页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 2 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

醉鱼当道

贡献于2013-05-13

下载需要 2 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf