数据科学的实践


The Pracce of Data Science 数据科学的实践 Ying Li, EV Analysis Corporaon Outline •  Overview the pracce of data science •  Recommend principles for the pracce of data science •  Explain the principles through examples •  Ancipate an outlook of the pracce and what to prepare for hp://hbr.org/2012/10/data-scienst-the-sexiest-job-of-the-21st-century/ar/pr Some Definions of Data Science/Sciensts Source: KDnuggets hp://www.kdnuggets.com/2013/12/what-is-wrong-with-definion-data-science.html hp://www.forbes.com/sites/gilpress/2012/09/27/data-sciensts-the-definion-of-sexy/ Source: The Data Science Venn Diagram by Drew Conway hp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram What does Data Scienst Do source: CRISP-DM process hp://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext •  Define the queson •  Define the ideal data set •  Determine what data you can access •  Obtain the data •  Clean the data •  Exploratory data analysis •  Stascal predicon/modeling •  Interpret results •  Challenge results •  Synthesize/write up results •  Create reproducible code •  Distribute results to other people source: coursera Data Science Signature Track Principles: QUEST •  Queson –  be very clear on what the quesons you are trying to answer •  quesons can change mid way, but all things you do is toward answering quesons •  part of the work may be towards figuring out what quesons to ask to data, then the queson is “what can I learn from the data that’s useful to my business” –  if you are not answering a queson, you may be at a different stage, such as data engineering •  Unknowns - know what you do not know –  be very clear on what’s unknown and the assumpons made explicitly or implicitly made •  assume there are unknowns, check for •  know the assumpons that comes with the formula •  Explore –  looking at the data from different perspecves, drill down and up, zooming in to different part of the data, –  look for external data to enrich and validate your data •  Scrupulous vs. Speed, Science vs. Scrappiness –  painstakingly check you scripts/methods while knowing that speed to insight and producon is crical –  hold highest scienfic standard, but willing to go scrappy when needed while being intenonal •  Truth –  truth about the world that my business is operate in –  truth about the data itself will help but cannot directly generalize to truth about the world Quesons •  many core system performance degrades as the number of cores increases •  what are the factors causing this degradaon? –  looked at state transion of each processes over me –  we know lots of me spent on garbage collecon •  what can we do to improve the performance? –  modify the garbage collecon algorithm –  modify the memory locaon usage so that data close to operaon Source: Concurix hp://pdos.csail.mit.edu/papers/linux:osdi10.pdf Unknowns •  Targeng sell-thru rate calculaon –  delivered impression divided by targetable impression, or –  targeng inventory sold as targeng divided by total inventory hp://www.smithsonianmag.com/science-nature/why-google-flu- trends-cant-track-flu-yet-180950076/?no-ist Targeng Revenue Accuracy Intent Data Quality Breadth Per User Depth Per User CTR Model efficacy Pricing Posioning Inventory Sell Through # of Adversers # of Segments Impression per user Segment Reach BT Reach Targetable Impressions Targeng Quality Sales Process Rev per adverser eCPA Vercal Latency Explore Scrupulous vs. Speed •  “my data science team is too academic” •  Somemes the customer just needs a first pass intuion –  “I just need some rough data to get my thinking started” –  “I just need some direconal data to know what are my opons, I don’t need to make decisions yet” –  “I just to learn some basics about the product space” •  Other mes the customer needs the analysis to be completely flawless as business decisions are made based on the results –  many business made big deals based on some careless “analysis” and the deals turned into big loss •  Most of the me, we need to make sound, balanced and ulmately judgment call on when the analysis quality meets the bar for release –  be the first to go to market – be willing to release MVP Truth •  Analysis outcome: truth about the world that you can cerfy with data –  populaon quanty, real world subjects - the unit of your study •  user, machine, app, trend, •  your company business •  a disease of public health concern •  Focus is not merely truth about the data, e.g. data properes Descripve Analysis – Form the Quesons •  Business KPI (registraon) by the hour in a day •  Looks about the same everyday •  Natural queson: why registraon drops everyday at this parcular hour? –  looked at one week, one month, one year… no clue –  looked at aggregaons of different grain… no clue •  Usual human behavior don’t swing up and down each hour so regularly –  next queson: is this a correct representaon of registraon by the hour? •  filter by geo-locaon – chart starts to look smoother •  producon paern - registraon transacon is commied at hour end •  New queson: are the server me zone synchronized? •  Descripve results should not be generalized without addional analysis Exploratory Analysis – Know the Data •  Predict human listener emoon elicited by the paralinguisc aspects of a speech Exploratory Analysis – Look for Paerns •  Lots of plots •  Lots of stats •  Show comparison •  Mulvariate data •  Associaon rule •  Clustering •  We all know correlaons do not imply causaon, but we sll oen act as if they do –  correlaon: “Visits Lexus website” and “Listen to Music online” –  acon: put music on Lexus web site –  “does music make people buy Lexus?” –  “does driving Lexus make people like music?” Source: jobaline.com white paper Predicve Modeling – Math to Money Geographic • Country • State • City • DMA • Zip • Zip+4 • Real-me Locaon • Current Place Demographic • Age • Gender • Language • Income • Educaon • Employment • Marital Status • Ethnicity • Socioeconomic • Life-Stage • Teen, Collegiate, Single, Newlywed, Family, Reree. • Li-Events • Employed, Renng, Marriage, Moving Psychographic • Interests • Lifestyle • Brands • Products • Opinions • Atudes • Polical • Religious beliefs Social • Connecons • Friends • Groups • Likes/Dislikes • Communicaons • Mail • Messaging • Posts • Status Behavioral • Contextual Behaviors • Searches • Browsing • Ad Interacons • Product Usage • News, Movies, Music content • IMs, Mail, Storage, • OS, Producvity s/w, Gaming • Subscripons, Memberships Tasks • Purchase Products/Services • In short term • In long term • Researching Products/Services • Planning Events and Occasions • Buying air ckets • Booking Hotels • Leisure and Entertainment • Dining Out • Listening Music • Watching Movies Training Data Collecon Feature Extracon Feature Aggregaon Feature Selecon Experimentaon Runme Scoring Model Training TV Exposure Opportunity Online Exposure Opportunity Control n=185 No OTS No OTS TV Only n=518 No OTS TV + Online n=190 Causal Analysis – Very Hard Mechanisc Analysis – Change Lead Changes •  Detecng boleneck based on trace data •  Simulate traffics •  Idenfy changes to behavior of each processes •  Processes with highest number of reducons per incoming messages Trend •  Data –  the many V’s will connue to grow –  public availability will connue to grow –  data power explosion through combining different data scources •  Science –  reproducibility will be demanded •  Data scienst –  porolio –  clarificaon/specializaon –  being measured –  accountability/responsibility as movator •  does my analysis result maer – call to acons from my analysis •  confirming management’s desire vs. making real discovery •  high stake analyses –  policy/society –  health/life science Data - Combinatorial Power Ability to Work with Data •  Different format of data –  database tables –  Event logs from human or machines –  Text files – html, pdf, plain text, –  sound data – human speech, music –  image/video data •  Different sources of data –  Value of publicly available data –  Ability to work with API’s •  Different data structure –  arrays –  me series –  graph –  text strings –  wave/mp3/image •  Usually 80% of all me is spent on data cleaning and preparaon –  although listed as the first step in the process, it is not meant to be done once only, oen the data cleaning/transformaon need to be modified iteravely •  Data issues that need to be dealt with –  illegal value, e.g. gender, address, 999, non- legible character in text –  inconsistency, e.g. me zone, me synchronizaon –  misspelling –  duplicate –  contradicon –  missing value –  incomplete •  “Dirness” reflects the use of data –  “dirness” depends on the analysis task •  misspell is oen bought by paid search adversers •  Data intuion To Be Successful – or how to measure •  Width, Depth, Time –  Width in data –  Depth in science •  algorithms •  models –  Time to insight –  Time-testedness •  Problem solver –  Willing to rely on others and willing to find answers on your own –  Knowledgeable about where to find answers on your own –  Un-inmidated by new data types or packages –  Unafraid to say you don't know the answer –  Relentless What Should the Leaders Do •  Making data analycs an important part of the organizaon’s DNA. •  Ask the right quesons during your reviews, e.g. –  What do you know about your unknowns? –  What assumpons have you made about the known variables? –  How many iteraons have you gone through on the models? –  How do you measure the performance of your models? •  Data –  feature ulizaon –  enrichment •  Models –  me to model –  total live, sold, sold out –  response rate, and its li –  accuracy •  Data products –  Unit price trend per target –  sold, sold out, sell-thru –  base populaons and identy •  scored, scored and acve, targeted –  Latency, availability •  Batch vs. real me, DB response •  Upme –  Tools •  New target creaon me •  New data source onboarding me •  Inventories –  Targetable –  Targetable and targeted •  Operaon –  Time to deploy, restore –  Time to discover, response –  Time to resoluon Source: Who’s afraid of Data-Driven Management, HBR Blog May 16, 2014 Wisdom Collective application of knowledge in action 智慧: 对事物理解 和处理的能力 Knowledge A message meant to change receiver’s perception 知识:事物的状态和变化的规 律 Information A message meant to change receiver’s perception 信息:事物的状态和状态变化 Data discrete, objective facts about events 数据: 零散事实,描述事物与事件 Adding Value: •  Contextualized •  Categorized •  Calculated •  Corrected •  Condensed Adding Value: •  Comparison •  Consequence •  Connections •  Conversations Adding Value: •  Action-oriented •  Measurable efficiency •  Wiser decisions The Wisdom Pyramid 智慧金字塔 Source: Adapted from Liebowitz, (2003) Conclusion •  What is the most important in the pracce of data science? – Science – Data – Queson •  Actually – A Confession yingli@evanalysiscorp.com WeChat: bigdata7801
还剩22页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 8 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

bdb4

贡献于2015-02-16

下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf