Hadoop在雅虎的应用


Hadoop Usage At Yahoo! Milind Bhandarkar (milindb@yahoo-inc.com) About Me • Parallel Programming since 1989 • High-Performance Scientific Computing 1989 - 2005, Data-Intensive Computing 2005 - ... • Hadoop Solutions Architect @ Yahoo! • Contributor to Hadoop since 2006 • Training, Consulting, Capacity Planning History • 2004-2005: Hadoop prototyped in Apache Lucene • January 2006: Hadoop becomes subproject of Lucene • January 2008: Hadoop becomes top-level Apache Project • Latest Release: 0.21 • Stable Release: 0.20.x (+ Strong Authentication) Hadoop Ecosystem •HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest cluster is 4000+ servers • 1+ Million Jobs per month • 170+ PB of Storage • 10+ TB Compressed Data Added Per Day • 1000+ Users !"# $"# %"# &"# '"# ("# )"# *"# +"# "# *'"# *""# +'"# +""# '"# "# *""&# *""%# *""$# *""!# *"+"# ,-./0# )$1#2345346# +%"#78#29-4/:3# +;<#;-=9>?0#@-A6# !"#$%&'(%)#*)+,-.,-%) /,0&120,%) 3,%,&-4") +45,'4,) 678&40) 9&5:2) /-#($4;#') B/.--C#2345346# B/.--C#29-4/:3#D78E# Hadoop Growth Hadoop Clusters • Hadoop Dev, QA, Benchmarking (10%) • Sandbox, Release Validation (10%) • Science, Ad-Hoc Usage (50%) • Production (30%) Sample Applications • Science + Big Data + Insight = Personal Relevance = Value • Log processing: Analytics, Reporting, Buzz • User Modeling • Content Optimization, Spam filters • Computational Advertising Content (Web Pages, Blogs, News Articles, Media) Search Queries Advertisements (Display, Search) User Major Data Sources Web Graph Analysis • 100+ Billion Web Pages • 1+ PB Content • 2 Trillion links • 300+ TB of compressed output • Before Hadoop: 1 Month • With Hadoop: 1 Week Search Assist Search Assist • Related concepts occur closer together • 3 years of query logs, sessionized per user • 10+ Terabytes of Natural Language Text Corpus • Build Dictionaries with Hadoop & Push to Serving • Before Hadoop: 4 Weeks • With Hadoop: 30 minutes Mail Spam Filtering • Challenge: Scale • 450+ Million mailboxes • 5+ Billion deliveries per day • 25+ Billion connections • Challenge: User feedback is often late, noisy, inconsistent Data Factory • 40+ Billions of Events Per Day • Parse & Transform Event Streams • Join Clicks & Views • Filter out Robots • Aggregate, Sort, Partition • Data Quality Checks User Modeling • Objective: Determine User-Interests by mining user-activities • Large dimensionality of possible user activities • Typical user has sparse activity vector • Event attributes change over time User Activities Attribute Possible Values Typical Values Per User Pages Queries Ads 1+ Million 10-100 100+ Millions 10s 100+ Thousands 10s User-Modeling Pipeline • Sessionization • Feature and Target Generation • Model Training • Offline Scoring & Evaluation • Batch Scoring & Upload to serving Data Acquisition User Time Event Source U0 T0 Visited Yahoo! Autos Web Server logs U0 T1 Searched for “Car Insurance” Search Logs U0 T2 Browsed stock quotes Web Server Logs U0 T3 Saw ad for “discount brokerage”, did not click Ad Logs U0 T4 Checked Yahoo! Mail Web Server Logs U0 T5 Clicked Ad for “Auto Insurance” Ad Logs, Click Logs Normalization User Time Event Tag U0 T0 View Category: Autos, Tag: Mercedes Benz U0 T1 Query Category: Insurance, Tag: Auto U0 T2 View Category: Finance, Tag: YHOO U0 T3 View-Click Category: Finance, Tag:Brokerage U0 T4 Browse Irrelevant Event, Dropped U0 T5 View+Click Category: Insurance, Tag: Auto Features & Targets Time Query View Click T Feature Window Target Window Targets • User-Actions of Interest • Clicks on Ads & Content • Site & Page visits • Conversion Events • Purchases, Quote requests • Sign-Up for membership etc Features • Summary of user activities over a time- window • Aggregates, moving averages, rates over various time-windows • Incrementally updated Joining Targets & Features • Target rates very low: 0.01% ~ 1% • First, construct targets • Filter user activity without targets • Join feature vector with targets Model Training • Regressions • Boosted Decision Trees • Naive Bayes • Support Vector Machines • Maximum Entropy modeling • Constrained Random Fields Model Training • Some algorithms are difficult/inefficient to implement in Map-Reduce • Require fine-grain iterations • Different models in parallel • Model for each target response in parallel Offline Scoring & Evaluation • Apply model weights to features • Pleasantly parallel • Janino Embedded Compiler • Sort by scores and compute metrics • Evaluate metrics Batch Scoring • Apply models to features from all user activity • Upload scores to serving systems User Modeling Pipeline Component Data Volume Time Data Acquisition Feature & Target Generation Model Training Scoring 1 TB 2-3 Hours 1 TB * Feature Window Size 4-6 Hours 50 - 100 GB 1-2 Hours for 100s of Models 500 GB 1 Hour Acknowledgements • Vijay K Narayanan • Vishwanath Ramarao • Nitin Motgi • And Numerous Hadoop Application Developers at Yahoo! Questions ?
还剩29页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 5 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

bnw88

贡献于2015-12-31

下载需要 5 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf