李炜 - Big Data Platform


1 EBAY BIG DATA PLATFORM Mar 15, 2013 Steven Li DATA-AS-A-SERVICE OVERVIEW - Outline - Data Platform History Big Data Platform Overview User Scenarios for Data Service 3 PREHISTORIC TIME n  WHAT: A handful of reports n  HOW: C++ & SQL running directly on the eBay site n  USERS: 10 n  REPORT TURNARAOUND: 15 days n  TIMELINESS: Real-time 4 Analytics: The Beginning YEAR 1998 >50 TB/day new data >100 PB/day >100 Trillion pairs of information Millions of queries/day >7500 business users & analysts >50k chains of logic 24x7x365 99.98+% Availability turning over a TB every second Active/Active Near-Real-time >100k data elements Always online Processed 5 Analytics: Today YEAR 2012 52N Co-existent 5300/5350/5380 (30 TB /50 TB Max) ‘99 ’01 ‘11 ’03 ‘05 ‘07 ‘09 Access DB & MS Excel Reports (3 MB) Informatica Oracle DW & Business Objects (500 GB) Campaign Mgmt DMs: FADE, RFM, Power Sellers, VCRU… (4 TB) Teradata 12N 5300 (8 TB) Dual System Primary: 64N 5400 Secondary: 60N 5380 Auto Gen ETL SOA PET (aka VDM 1.0) Ab Initio Tactical Workload Member Insight ABC (RAM) Singularity V1 VDM Tableau Analytics as a Service Community Analytics (DataHub) Singularity V2 Hadoop 6 MicroStrategy SAS MAX Portal Evolution Timeline Hadoop Clusters Athena @ Sacramento - 550 Nodes •  Launched May 2010 Ares @ Las Vegas - 1900Nodes 40PB Storage •  Launched April 2011 Titan1 @ Las Vegas - 225 Nodes 5.4PB Storage •  Launched April 2011 Apollo @ Phoenix - 1900Nodes 40PB Storage •  Launched November 2011 Titan1 @ Phoenix - 225 Nodes 5.4PB Storage •  Launched November 2011 8 Ab Initio UC4 SOA Data Integration Golden Gate BES Cascading Data Platform ~6 PB Teradata 5555 Relational Data Dual System EDW ENTERPRISE-CLASS SYSTEM 40+ PB Semi Structured & Relational Data Deep Storage “SINGULARITY” LOW END ENTERPRISE-CLASS SYSTEM 40+ PB Unstructured Data Pattern Detection Deep Storage HADOOP CLUSTERS COMMODITY HARDWARE SYSTEM Analyze & Report Discover & Explore MicroStrategy SAS SOA/DAL Data Access R SQL Tableau Java-M/R “ Big Data ” Challenge - Platforms “ Big Data ” Challenge – Data Sharing 9 HADOOP Unstructured Data Pattern Detection SINGULARITY Semi Structured & Relational Data EDW Structured & Relational Data SITE Live & Transaction Data Volume Metadata ? “ Big Data ” Challenge – Productivity 10 ANSI-SQL MapReduce Java XML Unix Shell ETL Data Flow Abinitio-DML Sojourner Log Meta TD-SQL Pig Latin Processe 10 Bridge Analytic Systems Source Systems Data Integration Data Engineer Application & Service Big Data Platform Overview 11 Oracle Singularity CAL Teradata Hadoop Cluster NoSQL DB Metadata Management Metadata Repository Metadata Converter Data Processing Framework Data Movement Data Access Extension Job Scheduling & Monitoring IDE Web Portal 3rd Party Apps Data Mining Applications Near Line TPT Parallel Exports Listener Files Data Hub Data Engineer Business User Site Analyst Data Scientist Bridge Data Shift Service 12 Feature •  Zero ETL Coding •  Heterogeneous source/target dataset support •  Hadoop Metadata Auto-Gen •  Job Scheduling Auto-Gen •  Open APIs for any portal Self Service & MetaDriven Persona: Data Scientist / PM / BU / Analyst •  No Technical Barrier •  No Engineering Support Scenario •  Cross-Platform Data Integration •  Prototyping / POC Unified Metadata 13 Hive Pig Java LegaDom Teradata Oracle MySQL Cascading Files M/R TD-DDL ORA-DDL DDL Hive DDL Pig-Latin Jar lib Unified Metadata Meta Repo HADOOP Unified Metadata Meta Mapper legaLoader.rb Meta Loader Meta Repo legaDomHadoopWriGen.rb Meta Generator Meta Libs Various of Sources Various of Targets Bridge Analytic Systems Data Shift – Fast Bridge with Big Data systems 15 • Fast Bridge with Big Data systems • Implement optimized data pipeline Teradata-Hadoop •  10 TB/Min • No utilities needed • Executed solely through SQL Singularity Teradata Hadoop Cluster Near Line TPT Data Shift – Fast Bridge with Big Data systems 16 Building Data Application Productivity Problem 17 1 App, 75 jobs Green = map + reduce Purple = map Blue = join/merge Yellow = map split Hardware and Software are scalable, but Peopleware NOT! Hadoop Data Processing Framework - Plumbum •  A data processing framework on top of Cascading. •  Quickly building data processing application •  Test-Driven Development •  Reusable processing components •  Data Flow - Visualization •  Solves common issues, promotes standards and best practice. •  Grow with the community. 18 MapReduce Hive Pig Cascading Cascalo g Py Cascadi ng Cascadi ng. Jruby Scaldin g Crun ch Scrunch Scoobi Building Data Application with Plumbum 19 KW Impress ion User Categor y MapReduce Jobs Data Products Best Practice •  Data Flow Design Visualization •  Code Generating •  Test/Debug/Performance Optimazation •  Deploy on Cluster •  Data Products •  Processing Component Resuable Scenario •  Build complex Data Processing •  Productionalized Data Application Persona Data Engineer/Data Scientist •  Logic focus •  Efficient to develop •  Implement various data application Machine Learning – Pattern Implementation 20 Scenario •  User purchase history as the training data set •  Train a risk classifier using Random Forest •  Export model from R to PMML •  Compile a Cascading app to execute the PMML model •  deploy the app at scale to calculate scores Persona Data Scientist / Data Engineer •  Easy Pattern implementation •  Deploy the app at scale PMML Data Processing Framework - Plumbum Scores Training Datasets Data Scientist Machine Learning – Pattern Implementation 21 Partner Data EDW Customer DB In Memory Data Grid ETL Offline Online Customer transacons Score new orders Hadoop Risk classifier dimension: pre-order Risk classifier dimension: customer 360 PMML model Analyst’s laptop model training Data prep. Test models at scale Score all customers Model Implementation Cascading Apps Best Practices 0 2 4 6 8 10 Scalability Governance Re-use Availability Flexibility Access Methods Ease of Use Self-service Visual Exploration Thank you! Steven Li email: weli@ebay.com 24 @InfoQ infoqchina
还剩23页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 2 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

醉鱼当道

贡献于2013-05-13

下载需要 2 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf