O’Reilly MediaInc Big Data Resources for Media

O’Reilly Media Inc Big Data Resources for Media Big Data Resources for Media by O’Reilly Media Inc Copyright © 2013 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. September 2013: First Edition Revision History for the First Edition: 2013-09-11: First release Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their prod‐ ucts are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Table of Contents Part I. Data Science for Business Introduction: Data-Analytic Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Ubiquity of Data Opportunities 3 Example: Hurricane Frances 5 Example: Predicting Customer Churn 6 Data Science, Engineering, and Data-Driven Decision Making 7 Data Processing and “Big Data” 11 From Big Data 1.0 to Big Data 2.0 12 Data and Data Science Capability as a Strategic Asset 13 Data-Analytic Thinking 16 This Book 18 Data Mining and Data Science, Revisited 19 Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist 20 Summary 21 Business Problems and Data Science Solutions. . . . . . . . . . . . . . . . . . 23 From Business Problems to Data Mining Tasks 24 Supervised Versus Unsupervised Methods 29 Data Mining and Its Results 32 The Data Mining Process 32 Business Understanding 33 Data Understanding 34 Data Preparation 36 Modeling 37 Evaluation 37 iii Deployment 39 Implications for Managing the Data Science Team 41 Other Analytics Techniques and Technologies 43 Statistics 43 Database Querying 45 Data Warehousing 47 Regression Analysis 47 Machine Learning and Data Mining 48 Answering Business Questions with These Techniques 49 Summary 50 Part II. Bad Data Handbook Detecting Liars and the Confused in Contradictory Online Reviews. 55 Weotta 55 Getting Reviews 56 Sentiment Classification 57 Polarized Language 58 Corpus Creation 60 Training a Classifier 61 Validating the Classifier 63 Designing with Data 64 Lessons Learned 65 Summary 66 Resources 66 Part III. Mining the Social Web Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet. 71 Pen : Sword :: Tweet : Machine Gun (?!?) 71 Analyzing Tweets (One Entity at a Time) 75 Tapping (Tim’s) Tweets 79 Who Does Tim Retweet Most Often? 94 What’s Tim’s Influence? 98 How Many of Tim’s Tweets Contain Hashtags? 101 Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) 104 What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets? 106 iv | Table of Contents On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags? 112 Which Gets Retweeted More Often: #JustinBieber or #TeaParty? 112 How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets? 115 Visualizing Tons of Tweets 117 Visualizing Tweets with Tricked-Out Tag Clouds 118 Visualizing Community Structures in Twitter Search Results 122 Closing Remarks 126 Part IV. Planning for Big Data Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cxxxi The Feedback Economy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Data-Obese, Digital-Fast 134 The Big Data Supply Chain 134 Data collection 135 Ingesting and cleaning 136 Hardware 136 Platforms 136 Machine learning 137 Human exploration 138 Storage 138 Sharing and acting 138 Measuring and collecting feedback 139 Replacing Everything with Data 139 A Feedback Economy 140 What Is Big Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 What Does Big Data Look Like? 142 Volume 143 Velocity 144 Variety 145 In Practice 146 Cloud or in-house? 146 Big data is big 147 Big data is messy 147 Culture 147 Table of Contents | v Know where you want to go 148 Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 The Core of Hadoop: MapReduce 150 Hadoop’s Lower Levels: HDFS and MapReduce 150 Improving Programmability: Pig and Hive 151 Improving Data Access: HBase, Sqoop, and Flume 151 Getting data in and out 152 Coordination and Workflow: Zookeeper and Oozie 153 Management and Deployment: Ambari and Whirr 153 Machine Learning: Mahout 153 Using Hadoop 154 Big Data Market Survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Just Hadoop? 156 Integrated Hadoop Systems 156 EMC Greenplum 157 IBM 158 Microsoft 159 Oracle 160 Availability 161 Analytical Databases with Hadoop Connectivity 161 Quick facts 162 Hadoop-Centered Companies 162 Cloudera 163 Hortonworks 163 An overview of Hadoop distributions (part 1) 163 An overview of Hadoop distributions (part 2) 165 Notes 166 Microsoft’s Plan for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Microsoft’s Hadoop Distribution 169 Developers, Developers, Developers 170 Streaming Data and NoSQL 171 Toward an Integrated Environment 172 The Data Marketplace 172 Summary 172 Big Data in the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 IaaS and Private Clouds 175 Platform solutions 176 vi | Table of Contents Amazon Web Services 177 Google 178 Microsoft 179 Big data cloud platforms compared 180 Conclusion 180 Notes 181 Data Marketplaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 What Do Marketplaces Do? 183 Infochimps 184 Factual 185 Windows Azure Data Marketplace 186 DataMarket 186 Data Markets Compared 187 Other Data Suppliers 187 The NoSQL Movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Size, Response, Availability 191 Changing Data and Cheap Lunches 193 The Sacred Cows 197 Other features 199 In the End 201 Why Visualization Matters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 A Picture Is Worth 1000 Rows 203 Types of Visualization 204 Explaining and exploring 204 Your Customers Make Decisions, Too 205 Do Yourself a Favor and Hire a Designer 205 The Future of Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 More Powerful and Expressive Tools for Analysis 207 Streaming Data Processing 208 Rise of Data Marketplaces 209 Development of Data Science Workflows and Tools 209 Increased Understanding of and Demand for Visualization 210 Recommended Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Table of Contents | vii PART I Data Science for Business Dream no small dreams for they have no power to move the hearts of men. —Johann Wolfgang von Goethe Introduction: Data-Analytic Thinking The past fifteen years have seen extensive investments in business in‐ frastructure, which have improved the ability to collect data through‐ out the enterprise. Virtually every aspect of business is now open to data collection and often even instrumented for data collection: op‐ erations, manufacturing, supply-chain management, customer be‐ havior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on ex‐ ternal events such as market trends, industry news, and competitors’ movements. This broad availability of data has led to increasing in‐ terest in methods for extracting useful information and knowledge from data—the realm of data science. The Ubiquity of Data Opportunities With vast amounts of data now available, companies in almost every industry are focused on exploiting data for competitive advantage. In the past, firms could employ teams of statisticians, modelers, and an‐ alysts to explore datasets manually, but the volume and variety of data have far outstripped the capacity of manual analysis. At the same time, computers have become far more powerful, networking has become ubiquitous, and algorithms have been developed that can connect da‐ tasets to enable broader and deeper analyses than previously possible. 3 The convergence of these phenomena has given rise to the increasingly widespread business application of data science principles and data- mining techniques. Probably the widest applications of data-mining techniques are in marketing for tasks such as targeted marketing, online advertising, and recommendations for cross-selling. Data mining is used for gen‐ eral customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value. The finance industry uses data mining for credit scoring and trading, and in operations via fraud detection and workforce management. Major retailers from Walmart to Amazon apply data mining through‐ out their businesses, from marketing to supply-chain management. Many firms have differentiated themselves strategically with data sci‐ ence, sometimes to the point of evolving into data mining companies. The primary goals of this book are to help you view business problems from a data perspective and understand principles of extracting useful knowledge from data. There is a fundamental structure to data- analytic thinking, and basic principles that should be understood. There are also particular areas where intuition, creativity, common sense, and domain knowledge must be brought to bear. A data per‐ spective will provide you with structure and principles, and this will give you a framework to systematically analyze such problems. As you get better at data-analytic thinking you will develop intuition as to how and where to apply creativity and domain knowledge. Throughout the first two chapters of this book, we will discuss in detail various topics and techniques related to data science and data mining. The terms “data science” and “data mining” often are used inter‐ changeably, and the former has taken a life of its own as various indi‐ viduals and organizations try to capitalize on the current hype sur‐ rounding it. At a high level, data science is a set of fundamental prin‐ ciples that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incor‐ porate these principles. As a term, “data science” often is applied more broadly than the traditional use of “data mining,” but data mining techniques provide some of the clearest illustrations of the principles of data science. It is important to understand data science even if you never intend to apply it yourself. Data-analytic thinking enables you to evaluate pro‐ 4 | Introduction: Data-Analytic Thinking posals for data mining projects. For example, if an employee, a con‐ sultant, or a potential investment target proposes to improve a par‐ ticular business application by extracting knowledge from data, you should be able to assess the proposal systematically and decide wheth‐ er it is sound or flawed. This does not mean that you will be able to tell whether it will actually succeed—for data mining projects, that often requires trying—but you should be able to spot obvious flaws, unrealistic assumptions, and missing pieces. Throughout the book we will describe a number of fundamental data science principles, and will illustrate each with at least one data mining technique that embodies the principle. For each principle there are usually many specific techniques that embody it, so in this book we have chosen to emphasize the basic principles in preference to specific techniques. That said, we will not make a big deal about the difference between data science and data mining, except where it will have a substantial effect on understanding the actual concepts. Let’s examine two brief case studies of analyzing data to extract pre‐ dictive patterns. Example: Hurricane Frances Consider an example from a New York Times story from 2004: Hurricane Frances was on its way, barreling across the Caribbean, threatening a direct hit on Florida’s Atlantic coast. Residents made for higher ground, but far away, in Bentonville, Ark., executives at Wal- Mart Stores decided that the situation offered a great opportunity for one of their newest data-driven weapons … predictive technology. A week ahead of the storm’s landfall, Linda M. Dillman, Wal-Mart’s chief information officer, pressed her staff to come up with forecasts based on what had happened when Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes’ worth of shopper his‐ tory that is stored in Wal-Mart’s data warehouse, she felt that the company could ‘start predicting what’s going to happen, instead of waiting for it to happen,’ as she put it. (Hays, 2004) Consider why data-driven prediction might be useful in this scenario. It might be useful to predict that people in the path of the hurricane would buy more bottled water. Maybe, but this point seems a bit ob‐ vious, and why would we need data science to discover it? It might be useful to project the amount of increase in sales due to the hurricane, to ensure that local Wal-Marts are properly stocked. Perhaps mining Example: Hurricane Frances | 5 1. Of course! What goes better with strawberry Pop-Tarts than a nice cold beer? the data could reveal that a particular DVD sold out in the hurricane’s path—but maybe it sold out that week at Wal-Marts across the country, not just where the hurricane landing was imminent. The prediction could be somewhat useful, but is probably more general than Ms. Dillman was intending. It would be more valuable to discover patterns due to the hurricane that were not obvious. To do this, analysts might examine the huge volume of Wal-Mart data from prior, similar situations (such as Hur‐ ricane Charley) to identify unusual local demand for products. From such patterns, the company might be able to anticipate unusual de‐ mand for products and rush stock to the stores ahead of the hurricane’s landfall. Indeed, that is what happened. The New York Times (Hays, 2004) re‐ ported that: “… the experts mined the data and found that the stores would indeed need certain products—and not just the usual flash‐ lights. ‘We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane,’ Ms. Dillman said in a recent interview. ‘And the pre-hurricane top- selling item was beer.’”1 Example: Predicting Customer Churn How are such data analyses performed? Consider a second, more typ‐ ical business scenario and how it might be treated from a data per‐ spective. This problem will serve as a running example that will illu‐ minate many of the issues raised in this book and provide a common frame of reference. Assume you just landed a great analytical job with MegaTelCo, one of the largest telecommunication firms in the United States. They are having a major problem with customer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone customers leave when their contracts expire, and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now satura‐ ted, the huge growth in the wireless market has tapered off. Commu‐ nications companies are now engaged in battles to attract each other’s customers while retaining their own. Customers switching from one company to another is called churn, and it is expensive all around: one 6 | Introduction: Data-Analytic Thinking company must spend on incentives to attract a customer while another company loses revenue when the customer departs. You have been called in to help understand the problem and to devise a solution. Attracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Your task is to devise a precise, step-by-step plan for how the data science team should use MegaTelCo’s vast data resources to decide which customers should be offered the special retention deal prior to the expiration of their contracts. Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget? Answering this question is much more complicated than it may seem initially. We will return to this problem repeatedly through the book, adding sophistication to our solution as we develop an understanding of the fundamental data science concepts. In reality, customer retention has been a major use of data mining technologies—especially in telecommunications and finance busi‐ nesses. These more generally were some of the earliest and widest adopters of data mining technologies, for reasons discussed later. Data Science, Engineering, and Data-Driven Decision Making Data science involves principles, processes, and techniques for un‐ derstanding phenomena via the (automated) analysis of data. In this book, we will view the ultimate goal of data science as improving de‐ cision making, as this generally is of direct interest to business. Figure 1-1 places data science in the context of various other closely related and data-related processes in the organization. It distinguishes data science from other aspects of data processing that are gaining increasing attention in business. Let’s start at the top. Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely on intuition. For example, a marketer could select advertisements based purely on her long experience in the field and her eye for what will work. Or, she Data Science, Engineering, and Data-Driven Decision Making | 7 Figure 1-1. Data science in the context of various data-related process‐ es in the organization. could base her selection on the analysis of data regarding how con‐ sumers react to different ads. She could also use a combination of these approaches. DDD is not an all-or-nothing practice, and different firms engage in DDD to greater or lesser degrees. The benefits of data-driven decision-making have been demonstrated conclusively. Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School conducted a study of how DDD af‐ fects firm performance (Brynjolfsson, Hitt, & Kim, 2011). They de‐ veloped a measure of DDD that rates firms as to how strongly they use data to make decisions across the company. They show that statisti‐ cally, the more data-driven a firm is, the more productive it is—even controlling for a wide range of possible confounding factors. And the differences are not small. One standard deviation higher on the DDD scale is associated with a 4%–6% increase in productivity. DDD also is correlated with higher return on assets, return on equity, asset uti‐ lization, and market value, and the relationship seems to be causal. 8 | Introduction: Data-Analytic Thinking The sort of decisions we will be interested in in this book mainly fall into two types: (1) decisions for which “discoveries” need to be made within data, and (2) decisions that repeat, especially at massive scale, and so decision-making can benefit from even small increases in decision-making accuracy based on data analysis. The Walmart ex‐ ample above illustrates a type 1 problem: Linda Dillman would like to discover knowledge that will help Walmart prepare for Hurricane Frances’s imminent arrival. In 2012, Walmart’s competitor Target was in the news for a data-driven decision-making case of its own, also a type 1 problem (Duhigg, 2012). Like most retailers, Target cares about consumers’ shopping habits, what drives them, and what can influence them. Consumers tend to have inertia in their habits and getting them to change is very difficult. Decision makers at Target knew, however, that the arrival of a new baby in a family is one point where people do change their shopping habits significantly. In the Target analyst’s words, “As soon as we get them buying diapers from us, they’re going to start buying everything else too.” Most retailers know this and so they compete with each other trying to sell baby-related products to new parents. Since most birth records are public, retailers obtain information on births and send out special offers to the new parents. However, Target wanted to get a jump on their competition. They were interested in whether they could predict that people are expecting a baby. If they could, they would gain an advantage by making offers before their competitors. Using techniques of data science, Target an‐ alyzed historical data on customers who later were revealed to have been pregnant, and were able to extract information that could predict which consumers were pregnant. For example, pregnant mothers often change their diets, their wardrobes, their vitamin regimens, and so on. These indicators could be extracted from historical data, as‐ sembled into predictive models, and then deployed in marketing cam‐ paigns. We will discuss predictive models in much detail as we go through the book. For the time being, it is sufficient to understand that a predictive model abstracts away most of the complexity of the world, focusing in on a particular set of indicators that correlate in some way with a quantity of interest (who will churn, or who will purchase, who is pregnant, etc.). Importantly, in both the Walmart and the Target examples, the data analysis was not testing a simple hypothesis. In‐ Data Science, Engineering, and Data-Driven Decision Making | 9 2. Target was successful enough that this case raised ethical questions on the deployment of such techniques. Concerns of ethics and privacy are interesting and very important, but we leave their discussion for another time and place. stead, the data were explored with the hope that something useful would be discovered.2 Our churn example illustrates a type 2 DDD problem. MegaTelCo has hundreds of millions of customers, each a candidate for defection. Tens of millions of customers have contracts expiring each month, so each one of them has an increased likelihood of defection in the near future. If we can improve our ability to estimate, for a given customer, how profitable it would be for us to focus on her, we can potentially reap large benefits by applying this ability to the millions of customers in the population. This same logic applies to many of the areas where we have seen the most intense application of data science and data mining: direct marketing, online advertising, credit scoring, financial trading, help-desk management, fraud detection, search ranking, product recommendation, and so on. The diagram in Figure 1-1 shows data science supporting data-driven decision-making, but also overlapping with data-driven decision- making. This highlights the often overlooked fact that, increasingly, business decisions are being made automatically by computer systems. Different industries have adopted automatic decision-making at dif‐ ferent rates. The finance and telecommunications industries were ear‐ ly adopters, largely because of their precocious development of data networks and implementation of massive-scale computing, which al‐ lowed the aggregation and modeling of data at a large scale, as well as the application of the resultant models to decision-making. In the 1990s, automated decision-making changed the banking and consumer credit industries dramatically. In the 1990s, banks and tel‐ ecommunications companies also implemented massive-scale sys‐ tems for managing data-driven fraud control decisions. As retail sys‐ tems were increasingly computerized, merchandising decisions were automated. Famous examples include Harrah’s casinos’ reward pro‐ grams and the automated recommendations of Amazon and Netflix. Currently we are seeing a revolution in advertising, due in large part to a huge increase in the amount of time consumers are spending online, and the ability online to make (literally) split-second adver‐ tising decisions. 10 | Introduction: Data-Analytic Thinking Data Processing and “Big Data” It is important to digress here to address another point. There is a lot to data processing that is not data science—despite the impression one might get from the media. Data engineering and processing are critical to support data science, but they are more general. For example, these days many data processing skills, systems, and technologies often are mistakenly cast as data science. To understand data science and data- driven businesses it is important to understand the differences. Data science needs access to data and it often benefits from sophisticated data engineering that data processing technologies may facilitate, but these technologies are not data science technologies per se. They sup‐ port data science, as shown in Figure 1-1, but they are useful for much more. Data processing technologies are very important for many data- oriented business tasks that do not involve extracting knowledge or data-driven decision-making, such as efficient transaction processing, modern web system processing, and online advertising campaign management. “Big data” technologies (such as Hadoop, HBase, and MongoDB) have received considerable media attention recently. Big data essentially means datasets that are too large for traditional data processing sys‐ tems, and therefore require new processing technologies. As with the traditional technologies, big data technologies are used for many tasks, including data engineering. Occasionally, big data technologies are actually used for implementing data mining techniques. However, much more often the well-known big data technologies are used for data processing in support of the data mining techniques and other data science activities, as represented in Figure 1-1. Previously, we discussed Brynjolfsson’s study demonstrating the ben‐ efits of data-driven decision-making. A separate study, conducted by economist Prasanna Tambe of NYU’s Stern School, examined the ex‐ tent to which big data technologies seem to help firms (Tambe, 2012). He finds that, after controlling for various possible confounding fac‐ tors, using big data technologies is associated with significant addi‐ tional productivity growth. Specifically, one standard deviation higher utilization of big data technologies is associated with 1%–3% higher productivity than the average firm; one standard deviation lower in terms of big data utilization is associated with 1%–3% lower produc‐ tivity. This leads to potentially very large productivity differences be‐ tween the firms at the extremes. Data Processing and “Big Data” | 11 From Big Data 1.0 to Big Data 2.0 One way to think about the state of big data technologies is to draw an analogy with the business adoption of Internet technologies. In Web 1.0, businesses busied themselves with getting the basic internet tech‐ nologies in place, so that they could establish a web presence, build electronic commerce capability, and improve the efficiency of their operations. We can think of ourselves as being in the era of Big Data 1.0. Firms are busying themselves with building the capabilities to process large data, largely in support of their current operations—for example, to improve efficiency. Once firms had incorporated Web 1.0 technologies thoroughly (and in the process had driven down prices of the underlying technology) they started to look further. They began to ask what the Web could do for them, and how it could improve things they’d always done—and we entered the era of Web 2.0, where new systems and companies began taking advantage of the interactive nature of the Web. The changes brought on by this shift in thinking are pervasive; the most obvious are the incorporation of social-networking components, and the rise of the “voice” of the individual consumer (and citizen). We should expect a Big Data 2.0 phase to follow Big Data 1.0. Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: “What can I now do that I couldn’t do before, or do better than I could do before?” This is likely to be the golden era of data science. The principles and techniques we introduce in this book will be applied far more broadly and deeply than they are today. It is important to note that in the Web 1.0 era some precocious com‐ panies began applying Web 2.0 ideas far ahead of the mainstream. Amazon is a prime example, incorporating the consumer’s “voice” early on, in the rating of products, in product reviews (and deeper, in the rating of product reviews). Similarly, we see some companies al‐ ready applying Big Data 2.0. Amazon again is a company at the fore‐ front, providing data-driven recommendations from massive data. There are other examples as well. Online advertisers must process extremely large volumes of data (billions of ad impressions per day is not unusual) and maintain a very high throughput (real-time bidding systems make decisions in tens of milliseconds). We should look to 12 | Introduction: Data-Analytic Thinking these and similar industries for hints at advances in big data and data science that subsequently will be adopted by other industries. Data and Data Science Capability as a Strategic Asset The prior sections suggest one of the fundamental principles of data science: data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets. Too many businesses regard data analytics as pertaining mainly to realizing value from some ex‐ isting data, and often without careful regard to whether the business has the appropriate analytical talent. Viewing these as assets allows us to think explicitly about the extent to which one should invest in them. Often, we don’t have exactly the right data to best make decisions and/ or the right talent to best support making decisions from the data. Further, thinking of these as assets should lead us to the realization that they are complementary. The best data science team can yield little value without the appropriate data; the right data often cannot sub‐ stantially improve decisions without suitable data science talent. As with all assets, it is often necessary to make investments. Building a top-notch data science team is a nontrivial undertaking, but can make a huge difference for decision-making. We will discuss strategic con‐ siderations involving data science in detail in Chapter 13. Our next case study will introduce the idea that thinking explicitly about how to invest in data assets very often pays off handsomely. The classic story of little Signet Bank from the 1990s provides a case in point. Previously, in the 1980s, data science had transformed the business of consumer credit. Modeling the probability of default had changed the industry from personal assessment of the likelihood of default to strategies of massive scale and market share, which brought along concomitant economies of scale. It may seem strange now, but at the time, credit cards essentially had uniform pricing, for two rea‐ sons: (1) the companies did not have adequate information systems to deal with differential pricing at massive scale, and (2) bank manage‐ ment believed customers would not stand for price discrimination. Around 1990, two strategic visionaries (Richard Fairbanks and Nigel Morris) realized that information technology was powerful enough that they could do more sophisticated predictive modeling—using the sort of techniques that we discuss throughout this book—and offer Data and Data Science Capability as a Strategic Asset | 13 different terms (nowadays: pricing, credit limits, low-initial-rate bal‐ ance transfers, cash back, loyalty points, and so on). These two men had no success persuading the big banks to take them on as consultants and let them try. Finally, after running out of big banks, they succeeded in garnering the interest of a small regional Virginia bank: Signet Bank. Signet Bank’s management was convinced that modeling profitability, not just default probability, was the right strategy. They knew that a small proportion of customers actually account for more than 100% of a bank’s profit from credit card operations (because the rest are break-even or money-losing). If they could model profitability, they could make better offers to the best customers and “skim the cream” of the big banks’ clientele. But Signet Bank had one really big problem in implementing this strategy. They did not have the appropriate data to model profitability with the goal of offering different terms to different customers. No one did. Since banks were offering credit with a specific set of terms and a specific default model, they had the data to model profitability (1) for the terms they actually have offered in the past, and (2) for the sort of customer who was actually offered credit (that is, those who were deemed worthy of credit by the existing model). What could Signet Bank do? They brought into play a fundamental strategy of data science: acquire the necessary data at a cost. Once we view data as a business asset, we should think about whether and how much we are willing to invest. In Signet’s case, data could be generated on the profitability of customers given different credit terms by con‐ ducting experiments. Different terms were offered at random to dif‐ ferent customers. This may seem foolish outside the context of data- analytic thinking: you’re likely to lose money! This is true. In this case, losses are the cost of data acquisition. The data-analytic thinker needs to consider whether she expects the data to have sufficient value to justify the investment. So what happened with Signet Bank? As you might expect, when Signet began randomly offering terms to customers for data acquisition, the number of bad accounts soared. Signet went from an industry-leading “charge-off” rate (2.9% of balances went unpaid) to almost 6% charge- offs. Losses continued for a few years while the data scientists worked to build predictive models from the data, evaluate them, and deploy them to improve profit. Because the firm viewed these losses as in‐ vestments in data, they persisted despite complaints from stakehold‐ ers. Eventually, Signet’s credit card operation turned around and be‐ 14 | Introduction: Data-Analytic Thinking 3. You can read more about Capital One’s story (Clemons & Thatcher, 1998; McNamee 2001). came so profitable that it was spun off to separate it from the bank’s other operations, which now were overshadowing the consumer credit success. Fairbanks and Morris became Chairman and CEO and President and COO, and proceeded to apply data science principles throughout the business—not just customer acquisition but retention as well. When a customer calls looking for a better offer, data-driven models calculate the potential profitability of various possible actions (different offers, including sticking with the status quo), and the customer service rep‐ resentative’s computer presents the best offers to make. You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One. Fairbanks and Morris’s new company grew to be one of the largest credit card issuers in the industry with one of the lowest charge-off rates. In 2000, the bank was reported to be carrying out 45,000 of these “scientific tests” as they called them.3 Studies giving clear quantitative demonstrations of the value of a data asset are hard to find, primarily because firms are hesitant to divulge results of strategic value. One exception is a study by Martens and Provost (2011) assessing whether data on the specific transactions of a bank’s consumers can improve models for deciding what product offers to make. The bank built models from data to decide whom to target with offers for different products. The investigation examined a number of different types of data and their effects on predictive per‐ formance. Sociodemographic data provide a substantial ability to model the sort of consumers that are more likely to purchase one product or another. However, sociodemographic data only go so far; after a certain volume of data, no additional advantage is conferred. In contrast, detailed data on customers’ individual (anonymized) transactions improve performance substantially over just using soci‐ odemographic data. The relationship is clear and striking and—sig‐ nificantly, for the point here—the predictive performance continues to improve as more data are used, increasing throughout the range investigated by Martens and Provost with no sign of abating. This has an important implication: banks with bigger data assets may have an important strategic advantage over their smaller competitors. If these Data and Data Science Capability as a Strategic Asset | 15 trends generalize, and the banks are able to apply sophisticated ana‐ lytics, banks with bigger data assets should be better able to identify the best customers for individual products. The net result will be either increased adoption of the bank’s products, decreased cost of customer acquisition, or both. The idea of data as a strategic asset is certainly not limited to Capital One, nor even to the banking industry. Amazon was able to gather data early on online customers, which has created significant switch‐ ing costs: consumers find value in the rankings and recommendations that Amazon provides. Amazon therefore can retain customers more easily, and can even charge a premium (Brynjolfsson & Smith, 2000). Harrah’s casinos famously invested in gathering and mining data on gamblers, and moved itself from a small player in the casino business in the mid-1990s to the acquisition of Caesar’s Entertainment in 2005 to become the world’s largest gambling company. The huge valuation of Facebook has been credited to its vast and unique data assets (Sen‐ gupta, 2012), including both information about individuals and their likes, as well as information about the structure of the social network. Information about network structure has been shown to be important to predicting and has been shown to be remarkably helpful in building models of who will buy certain products (Hill, Provost, & Volinsky, 2006). It is clear that Facebook has a remarkable data asset; whether they have the right data science strategies to take full advantage of it is an open question. In the book we will discuss in more detail many of the fundamental concepts behind these success stories, in exploring the principles of data mining and data-analytic thinking. Data-Analytic Thinking Analyzing case studies such as the churn problem improves our ability to approach problems “data-analytically.” Promoting such a perspec‐ tive is a primary goal of this book. When faced with a business prob‐ lem, you should be able to assess whether and how data can improve performance. We will discuss a set of fundamental concepts and prin‐ ciples that facilitate careful thinking. We will develop frameworks to structure the analysis so that it can be done systematically. As mentioned above, it is important to understand data science even if you never intend to do it yourself, because data analysis is now so critical to business strategy. Businesses increasingly are driven by data 16 | Introduction: Data-Analytic Thinking 4. Of course, this is not a new phenomenon. Amazon and Google are well-established companies that get tremendous value from their data assets. analytics, so there is great professional advantage in being able to in‐ teract competently with and within such businesses. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats. Firms in many traditional industries are exploiting new and existing data resources for competitive advantage. They employ data science teams to bring advanced technologies to bear to increase revenue and to decrease costs. In addition, many new companies are being devel‐ oped with data mining as a key strategic component. Facebook and Twitter, along with many other “Digital 100” companies (Business In‐ sider, 2012), have high valuations due primarily to data assets they are committed to capturing or creating.4 Increasingly, managers need to oversee analytics teams and analysis projects, marketers have to or‐ ganize and understand data-driven campaigns, venture capitalists must be able to invest wisely in businesses with substantial data assets, and business strategists must be able to devise plans that exploit data. As a few examples, if a consultant presents a proposal to mine a data asset to improve your business, you should be able to assess whether the proposal makes sense. If a competitor announces a new data part‐ nership, you should recognize when it may put you at a strategic dis‐ advantage. Or, let’s say you take a position with a venture firm and your first project is to assess the potential for investing in an adver‐ tising company. The founders present a convincing argument that they will realize significant value from a unique body of data they will col‐ lect, and on that basis are arguing for a substantially higher valuation. Is this reasonable? With an understanding of the fundamentals of data science you should be able to devise a few probing questions to de‐ termine whether their valuation arguments are plausible. On a scale less grand, but probably more common, data analytics projects reach into all business units. Employees throughout these units must interact with the data science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data Data-Analytic Thinking | 17 science projects than in other technical projects, because the data sci‐ ence is supporting improved decision-making. As we will describe in the next chapter, this requires a close interaction between the data scientists and the business people responsible for the decision-making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions. The need for managers with data-analytic skills The consulting firm McKinsey and Company estimates that “there will be a shortage of talent necessary for organizations to take advan‐ tage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” (Manyika, 2011). Why 10 times as many managers and analysts than those with deep analytical skills? Surely data scientists aren’t so difficult to manage that they need 10 managers! The reason is that a business can get leverage from a data science team for making better decisions in multiple areas of the business. However, as McKinsey is pointing out, the managers in those areas need to understand the fundamentals of data science to effectively get that leverage. This Book This book concentrates on the fundamentals of data science and data mining. These are a set of principles, concepts, and techniques that structure thinking and analysis. They allow us to understand data sci‐ ence processes and methods surprisingly deeply, without needing to focus in depth on the large number of specific data mining algorithms. There are many good books covering data mining algorithms and techniques, from practical guides to mathematical and statistical treat‐ ments. This book instead focuses on the fundamental concepts and how they help us to think about problems where data mining may be brought to bear. That doesn’t mean that we will ignore the data mining techniques; many algorithms are exactly the embodiment of the basic concepts. But with only a few exceptions we will not concentrate on the deep technical details of how the techniques actually work; we will 18 | Introduction: Data-Analytic Thinking try to provide just enough detail so that you will understand what the techniques do, and how they are based on the fundamental principles. Data Mining and Data Science, Revisited This book devotes a good deal of attention to the extraction of useful (nontrivial, hopefully actionable) patterns or models from large bodies of data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996), and to the fun‐ damental data science principles underlying such data mining. In our churn-prediction example, we would like to take the data on prior churn and extract patterns, for example patterns of behavior, that are useful—that can help us to predict those customers who are more likely to leave in the future, or that can help us to design better services. The fundamental concepts of data science are drawn from many fields that study data analytics. We introduce these concepts throughout the book, but let’s briefly discuss a few now to get the basic flavor. We will elaborate on all of these and more in later chapters. Fundamental concept: Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages. The Cross Industry Standard Pro‐ cess for Data Mining, abbreviated CRISP-DM (CRISP-DM Project, 2000), is one codification of this process. Keeping such a process in mind provides a framework to structure our thinking about data an‐ alytics problems. For example, in actual practice one repeatedly sees analytical “solutions” that are not based on careful analysis of the problem or are not carefully evaluated. Structured thinking about an‐ alytics emphasizes these often under-appreciated aspects of support‐ ing decision-making with data. Such structured thinking also con‐ trasts critical points where human creativity is necessary versus points where high-powered analytical tools can be brought to bear. Fundamental concept: From a large mass of data, information technol‐ ogy can be used to find informative descriptive attributes of entities of interest. In our churn example, a customer would be an entity of in‐ terest, and each customer might be described by a large number of attributes, such as usage, customer service history, and many other factors. Which of these actually gives us information on the customer’s likelihood of leaving the company when her contract expires? How much information? Sometimes this process is referred to roughly as finding variables that “correlate” with churn (we will discuss this no‐ Data Mining and Data Science, Revisited | 19 tion precisely). A business analyst may be able to hypothesize some and test them, and there are tools to help facilitate this experimentation (see “Other Analytics Techniques and Technologies” on page 43). Al‐ ternatively, the analyst could apply information technology to auto‐ matically discover informative attributes—essentially doing large- scale automated experimentation. Further, as we will see, this concept can be applied recursively to build models to predict churn based on multiple attributes. Fundamental concept: If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data min‐ ing to real problems. The concept of overfitting and its avoidance per‐ meates data science processes, algorithms, and evaluation methods. Fundamental concept: Formulating data mining solutions and evalu‐ ating the results involves thinking carefully about the context in which they will be used. If our goal is the extraction of potentially useful knowledge, how can we formulate what is useful? It depends critically on the application in question. For our churn-management example, how exactly are we going to use the patterns extracted from historical data? Should the value of the customer be taken into account in ad‐ dition to the likelihood of leaving? More generally, does the pattern lead to better decisions than some reasonable alternative? How well would one have done by chance? How well would one do with a smart “default” alternative? These are just four of the fundamental concepts of data science that we will explore. By the end of the book, we will have discussed a dozen such fundamental concepts in detail, and will have illustrated how they help us to structure data-analytic thinking and to understand data mining techniques and algorithms, as well as data science applications, quite generally. Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist Before proceeding, we should briefly revisit the engineering side of data science. At the time of this writing, discussions of data science commonly mention not just analytical skills and techniques for un‐ 20 | Introduction: Data-Analytic Thinking 5. OK: Hadoop is a widely used open source architecture for doing highly parallelizable computations. It is one of the current “big data” technologies for processing massive datasets that exceed the capacity of relational database systems. Hadoop is based on the MapReduce parallel processing framework introduced by Google. derstanding data but popular tools used. Definitions of data scientists (and advertisements for positions) specify not just areas of expertise but also specific programming languages and tools. It is common to see job advertisements mentioning data mining techniques (e.g., ran‐ dom forests, support vector machines), specific application areas (rec‐ ommendation systems, ad placement optimization), alongside popu‐ lar software tools for processing big data (Hadoop, MongoDB). There is often little distinction between the science and the technology for dealing with large datasets. We must point out that data science, like computer science, is a young field. The particular concerns of data science are fairly new and general principles are just beginning to emerge. The state of data science may be likened to that of chemistry in the mid-19th century, when theories and general principles were being formulated and the field was largely experimental. Every good chemist had to be a competent lab techni‐ cian. Similarly, it is hard to imagine a working data scientist who is not proficient with certain sorts of software tools. Having said this, this book focuses on the science and not on the tech‐ nology. You will not find instructions here on how best to run massive data mining jobs on Hadoop clusters, or even what Hadoop is or why you might want to learn about it.5 We focus here on the general prin‐ ciples of data science that have emerged. In 10 years’ time the pre‐ dominant technologies will likely have changed or advanced enough that a discussion here would be obsolete, while the general principles are the same as they were 20 years ago, and likely will change little over the coming decades. Summary This book is about the extraction of useful information and knowledge from large volumes of data, in order to improve business decision- making. As the massive collection of data has spread through just about every industry sector and business unit, so have the opportu‐ nities for mining the data. Underlying the extensive body of techniques for mining data is a much smaller set of fundamental concepts com‐ Summary | 21 prising data science. These concepts are general and encapsulate much of the essence of data mining and business analytics. Success in today’s data-oriented business environment requires being able to think about how these fundamental concepts apply to partic‐ ular business problems—to think data-analytically. For example, in this chapter we discussed the principle that data should be thought of as a business asset, and once we are thinking in this direction we start to ask whether (and how much) we should invest in data. Thus, an understanding of these fundamental concepts is important not only for data scientists themselves, but for anyone working with data sci‐ entists, employing data scientists, investing in data-heavy ventures, or directing the application of analytics in an organization. Thinking data-analytically is aided by conceptual frameworks dis‐ cussed throughout the book. For example, the automated extraction of patterns from data is a process with well-defined stages, which are the subject of the next chapter. Understanding the process and the stages helps to structure our data-analytic thinking, and to make it more systematic and therefore less prone to errors and omissions. There is convincing evidence that data-driven decision-making and big data technologies substantially improve business performance. Data science supports data-driven decision-making—and sometimes conducts such decision-making automatically—and depends upon technologies for “big data” storage and engineering, but its principles are separate. The data science principles we discuss in this book also differ from, and are complementary to, other important technologies, such as statistical hypothesis testing and database querying (which have their own books and classes). The next chapter describes some of these differences in more detail. 22 | Introduction: Data-Analytic Thinking Business Problems and Data Science Solutions Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining. An important principle of data science is that data mining is a pro‐ cess with fairly well-understood stages. Some involve the application of information technology, such as the automated discovery and eval‐ uation of patterns from data, while others mostly require an analyst’s creativity, business knowledge, and common sense. Understanding the whole process helps to structure data mining projects, so they are closer to systematic analyses rather than heroic endeavors driven by chance and individual acumen. Since the data mining process breaks up the overall task of finding patterns from data into a set of well-defined subtasks, it is also useful for structuring discussions about data science. In this book, we will use the process as an overarching framework for our discussion. This chapter introduces the data mining process, but first we provide ad‐ ditional context by discussing common types of data mining tasks. Introducing these allows us to be more concrete when presenting the overall process, as well as when introducing other concepts in subse‐ quent chapters. We close the chapter by discussing a set of important business analytics subjects that are not the focus of this book (but for which there are many other helpful books), such as databases, data warehousing, and basic statistics. 23 From Business Problems to Data Mining Tasks Each data-driven business decision-making problem is unique, com‐ prising its own combination of goals, desires, constraints, and even personalities. As with much engineering, though, there are sets of common tasks that underlie the business problems. In collaboration with business stakeholders, data scientists decompose a business prob‐ lem into subtasks. The solutions to the subtasks can then be composed to solve the overall problem. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks. For example, our telecommunications churn problem is unique to MegaTelCo: there are specifics of the problem that are different from churn problems of any other telecommunications firm. However, a subtask that will likely be part of the solution to any churn problem is to estimate from historical data the probability of a customer termi‐ nating her contract shortly after it has expired. Once the idiosyncratic MegaTelCo data have been assembled into a particular format (de‐ scribed in the next chapter), this probability estimation fits the mold of one very common data mining task. We know a lot about solving the common data mining tasks, both scientifically and practically. In later chapters, we also will provide data science frameworks to help with the decomposition of business problems and with the re- composition of the solutions to the subtasks. A critical skill in data science is the ability to decompose a data- analytics problem into pieces such that each piece matches a known task for which tools are available. Recognizing familiar problems and their solutions avoids wasting time and resources reinventing the wheel. It also allows people to focus attention on more interesting parts of the process that require human involvement—parts that have not been automated, so human creativity and intelligence must come into play. Despite the large number of specific data mining algorithms developed over the years, there are only a handful of fundamentally different types of tasks these algorithms address. It is worth defining these tasks clearly. The next several chapters will use the first two (classification and regression) to illustrate several fundamental concepts. In what follows, the term “an individual” will refer to an entity about which we have data, such as a customer or a consumer, or it could be an inani‐ 24 | Business Problems and Data Science Solutions mate entity such as a business. We will make this notion more precise in Chapter 3. In many business analytics projects, we want to find “correlations” between a particular variable describing an individual and other variables. For example, in historical data we may know which customers left the company after their contracts expired. We may want to find out which other variables correlate with a customer leaving in the near future. Finding such correlations are the most basic examples of classification and regression tasks. 1. Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclu‐ sive. An example classification question would be: “Among all the customers of MegaTelCo, which are likely to respond to a given offer?” In this example the two classes could be called will re spond and will not respond. For a classification task, a data mining procedure produces a model that, given a new individual, determines which class that individual belongs to. A closely related task is scoring or class probability estimation. A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class. In our customer response sce‐ nario, a scoring model would be able to evaluate each individual customer and produce a score of how likely each is to respond to the offer. Classification and scoring are very closely related; as we shall see, a model that can do one can usually be modified to do the other. 2. Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that in‐ dividual. An example regression question would be: “How much will a given customer use the service?” The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individuals in the population and their historical usage. A regression procedure produces a model that, given an individual, estimates the value of the particular variable specific to that individual. Regression is related to classification, but the two are different. Informally, classification predicts whether something will happen, From Business Problems to Data Mining Tasks | 25 whereas regression predicts how much something will happen. The difference will become clearer as the book progresses. 3. Similarity matching attempts to identify similar individuals based on data known about them. Similarity matching can be used di‐ rectly to find similar entities. For example, IBM is interested in finding companies similar to their best business customers, in or‐ der to focus their sales force on the best opportunities. They use similarity matching based on “firmographic” data describing characteristics of the companies. Similarity matching is the basis for one of the most popular methods for making product recom‐ mendations (finding people who are similar to you in terms of the products they have liked or have purchased). Similarity measures underlie certain solutions to other data mining tasks, such as clas‐ sification, regression, and clustering. We discuss similarity and its uses at length in Chapter 6. 4. Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose. An example clustering question would be: “Do our customers form natural groups or segments?” Clustering is useful in preliminary domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or ap‐ proaches. Clustering also is used as input to decision-making processes focusing on questions such as: What products should we offer or develop? How should our customer care teams (or sales teams) be structured? We discuss clustering in depth in Chapter 6. 5. Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involv‐ ing them. An example co-occurrence question would be: What items are commonly purchased together? While clustering looks at similarity between objects based on the objects’ attributes, co- occurrence grouping considers similarity of objects based on their appearing together in transactions. For example, analyzing pur‐ chase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect. Deciding how to act upon this discovery might require some creativity, but it could suggest a special promotion, product display, or combination offer. Co-occurrence of products in purchases is a common type of grouping known as market- basket analysis. Some recommendation systems also perform a 26 | Business Problems and Data Science Solutions type of affinity grouping by finding, for example, pairs of books that are purchased frequently by the same people (“people who bought X also bought Y”). The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is. 6. Profiling (also known as behavior description) attempts to char‐ acterize the typical behavior of an individual, group, or popula‐ tion. An example profiling question would be: “What is the typical cell phone usage of this customer segment?” Behavior may not have a simple description; profiling cell phone usage might re‐ quire a complex description of night and weekend airtime aver‐ ages, international usage, roaming charges, text minutes, and so on. Behavior can be described generally over an entire population, or down to the level of small groups or even individuals. Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems (such as someone breaking into your iTunes account). For example, if we know what kind of pur‐ chases a person typically makes on a credit card, we can determine whether a new charge on the card fits that profile or not. We can use the degree of mismatch as a suspicion score and issue an alarm if it is too high. 7. Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Link prediction is com‐ mon in social networking systems: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?” Link prediction can also estimate the strength of a link. For example, for recom‐ mending movies to customers one can think of a graph between customers and the movies they’ve watched or rated. Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong. These links form the basis for recommendations. 8. Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. The smaller dataset may be easier to deal with or to process. Moreover, the smaller dataset may better reveal the information. For example, a massive dataset on con‐ From Business Problems to Data Mining Tasks | 27 sumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example, viewer genre preferences). Data reduction usually involves loss of information. What is im‐ portant is the trade-off for improved insight. 9. Causal modeling attempts to help us understand what events or actions actually influence others. For example, consider that we use predictive modeling to target advertisements to consumers, and we observe that indeed the targeted consumers purchase at a higher rate subsequent to having been targeted. Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway? Techniques for causal modeling include those involving a substantial investment in data, such as randomized controlled experiments (e.g., so- called “A/B tests”), as well as sophisticated methods for drawing causal conclusions from observational data. Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis: they attempt to understand what would be the difference between the situations—which can‐ not both happen—where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen. In all cases, a careful data scientist should always include with a causal conclusion the exact assumptions that must be made in order for the causal conclusion to hold (there always are such assumptions—always ask). When undertaking causal modeling, a business needs to weigh the trade-off of increasing investment to reduce the assumptions made, versus deciding that the con‐ clusions are good enough given the assumptions. Even in the most careful randomized, controlled experimentation, assumptions are made that could render the causal conclusions invalid. The dis‐ covery of the “placebo effect” in medicine illustrates a notorious situation where an assumption was overlooked in carefully de‐ signed randomized experimentation. Discussing all of these tasks in detail would fill multiple books. In this book, we present a collection of the most fundamental data science principles—principles that together underlie all of these types of tasks. We will illustrate the principles mainly using classification, regression, similarity matching, and clustering, and will discuss others when they 28 | Business Problems and Data Science Solutions provide important illustrations of the fundamental principles (toward the end of the book). Consider which of these types of tasks might fit our churn-prediction problem. Often, practitioners formulate churn prediction as a prob‐ lem of finding segments of customers who are more or less likely to leave. This segmentation problem sounds like a classification problem, or possibly clustering, or even regression. To decide the best formu‐ lation, we first need to introduce some important distinctions. Supervised Versus Unsupervised Methods Consider two similar questions we might ask about a customer pop‐ ulation. The first is: “Do our customers naturally fall into different groups?” Here no specific purpose or target has been specified for the grouping. When there is no such target, the data mining problem is referred to as unsupervised. Contrast this with a slightly different question: “Can we find groups of customers who have particularly high likelihoods of canceling their service soon after their contracts expire?” Here there is a specific target defined: will a customer leave when her contract expires? In this case, segmentation is being done for a specific reason: to take action based on likelihood of churn. This is called a supervised data mining problem. A note on the terms: Supervised and unsupervised learning The terms supervised and unsupervised were inherited from the field of machine learning. Metaphorically, a teacher “supervises” the learn‐ er by carefully providing target information along with a set of ex‐ amples. An unsupervised learning task might involve the same set of examples but would not include the target information. The learner would be given no information about the purpose of the learning, but would be left to form its own conclusions about what the examples have in common. The difference between these questions is subtle but important. If a specific target can be provided, the problem can be phrased as a su‐ pervised one. Supervised tasks require different techniques than un‐ supervised tasks do, and the results often are much more useful. A supervised technique is given a specific purpose for the grouping— Supervised Versus Unsupervised Methods | 29 predicting the target. Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities are meaningful or will be useful for any particular purpose. Technically, another condition must be met for supervised data min‐ ing: there must be data on the target. It is not enough that the target information exist in principle; it must also exist in the data. For ex‐ ample, it might be useful to know whether a given customer will stay for at least six months, but if in historical data this retention informa‐ tion is missing or incomplete (if, say, the data are only retained for two months) the target values cannot be provided. Acquiring data on the target often is a key data science investment. The value for the target variable for an individual is often called the individual’s label, empha‐ sizing that often (not always) one must incur expense to actively label the data. Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised. The fundamental principles of data mining that we will present underlie all these types of technique. Two main subclasses of supervised data mining, classification and re‐ gression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical (often bi‐ nary) target. Consider these similar questions we might address with supervised data mining: “Will this customer purchase service S1 if given incentive I?” This is a classification problem because it has a binary target (the customer either purchases or does not). “Which service package (S1, S2, or none) will a customer likely pur‐ chase if given incentive I?” This is also a classification problem, with a three-valued target. “How much will this customer use the service?” This is a regression problem because it has a numeric target. The target variable is the amount of usage (actual or predicted) per customer. There are subtleties among these questions that should be brought out. For business applications we often want a numerical prediction over a categorical target. In the churn example, a basic yes/no prediction of 30 | Business Problems and Data Science Solutions Figure 2-1. Data mining versus the use of data mining results. The up‐ per half of the figure illustrates the mining of historical data to produce a model. Importantly, the historical data have the target (“class”) value specified. The bottom half shows the result of the data mining in use, where the model is applied to new data for which we do not know the class value. The model predicts both the class value and the probability that the class variable will take on that value. whether a customer is likely to continue to subscribe to the service may not be sufficient; we want to model the probability that the cus‐ tomer will continue. This is still considered classification modeling rather than regression because the underlying target is categorical. Where necessary for clarity, this is called “class probability estimation.” A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining (and for which we can obtain values for some example data). We will return to this in Chapter 3. Supervised Versus Unsupervised Methods | 31 1. See also the Wikipedia page on the CRISP-DM process model. Data Mining and Its Results There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often con‐ fuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct. In our churn example, consider the deployment scenario in which the results will be used. We want to use the model to predict which of our customers will leave. Specifically, assume that data mining has created a class probability estimation model M. Given each existing customer, described using a set of characteristics, M takes these characteristics as input and produces a score or probability estimate of attrition. This is the use of the results of data mining. The data mining produces the model M from some other, often historical, data. Figure 2-1 illustrates these two phases. Data mining produces the probability estimation model, as shown in the top half of the figure. In the use phase (bottom half), the model is applied to a new, unseen case and it generates a probability estimate for it. The Data Mining Process Data mining is a craft. It involves the application of a substantial amount of science and technology, but the proper application still in‐ volves art as well. But as with many mature crafts, there is a well- understood process that places a structure on the problem, allowing reasonable consistency, repeatability, and objectiveness. A useful co‐ dification of the data mining process is given by the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000), illus‐ trated in Figure 2-2.1 32 | Business Problems and Data Science Solutions Figure 2-2. The CRISP data mining process. This process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is, generally speaking, not a failure. Often the entire process is an exploration of the data, and after the first iter‐ ation the data science team knows much more. The next iteration can be much more well-informed. Let’s now discuss the steps in detail. Business Understanding Initially, it is vital to understand the problem to be solved. This may seem obvious, but business projects seldom come pre-packaged as clear and unambiguous data mining problems. Often recasting the problem and designing a solution is an iterative process of discovery. The diagram shown in Figure 2-2 represents this as cycles within a cycle, rather than as a simple linear process. The initial formulation The Data Mining Process | 33 may not be complete or optimal so multiple iterations may be neces‐ sary for an acceptable solution formulation to appear. The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, as we will describe, but often the key to a great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems. High- level knowledge of the fundamentals helps creative business analysts see novel formulations. We have a set of powerful tools to solve particular data mining prob‐ lems: the basic data mining tasks discussed in “From Business Prob‐ lems to Data Mining Tasks” on page 24. Typically, the early stages of the endeavor involve designing a solution that takes advantage of these tools. This can mean structuring (engineering) the problem such that one or more subproblems involve building models for classification, regression, probability estimation, and so on. In this first stage, the design team should think carefully about the use scenario. This itself is one of the most important concepts of data sci‐ ence, to which we have devoted two entire chapters (Chapters 7 and 11). What exactly do we want to do? How exactly would we do it? What parts of this use scenario constitute possible data mining models? In discussing this in more detail, we will begin with a simplified view of the use scenario, but as we go forward we will loop back and realize that often the use scenario must be adjusted to better reflect the actual business need. We will present conceptual tools to help our thinking here, for example framing a business problem in terms of expected value can allow us to systematically decompose it into data mining tasks. Data Understanding If solving the business problem is the goal, the data comprise the available raw material from which the solution will be built. It is im‐ portant to understand the strengths and limitations of the data because rarely is there an exact match with the problem. Historical data often are collected for purposes unrelated to the current business problem, or for no explicit purpose at all. A customer database, a transaction database, and a marketing response database contain different infor‐ mation, may cover different intersecting populations, and may have varying degrees of reliability. 34 | Business Problems and Data Science Solutions It is also common for the costs of data to vary. Some data will be avail‐ able virtually for free while others will require effort to obtain. Some data may be purchased. Still other data simply won’t exist and will require entire ancillary projects to arrange their collection. A critical part of the data understanding phase is estimating the costs and ben‐ efits of each data source and deciding whether further investment is merited. Even after all datasets are acquired, collating them may re‐ quire additional effort. For example, customer records and product identifiers are notoriously variable and noisy. Cleaning and matching customer records to ensure only one record per customer is itself a complicated analytics problem (Hernández & Stolfo, 1995; Elmagar‐ mid, Ipeirotis, & Verykios, 2007). As data understanding progresses, solution paths may change direc‐ tion in response, and team efforts may even fork. Fraud detection provides an illustration of this. Data mining has been used extensively for fraud detection, and many fraud detection problems involve classic supervised data mining tasks. Consider the task of catching credit card fraud. Charges show up on each customer’s account, so fraudulent charges are usually caught—if not initially by the company, then later by the customer when account activity is reviewed. We can assume that nearly all fraud is identified and reliably labeled, since the legiti‐ mate customer and the person perpetrating the fraud are different people and have opposite goals. Thus credit card transactions have reliable labels (fraud and legitimate) that may serve as targets for a supervised technique. Now consider the related problem of catching Medicare fraud. This is a huge problem in the United States costing billions of dollars annually. Though this may seem like a conventional fraud detection problem, as we consider the relationship of the business problem to the data, we realize that the problem is significantly different. The perpetrators of fraud—medical providers who submit false claims, and sometimes their patients—are also legitimate service providers and users of the billing system. Those who commit fraud are a subset of the legitimate users; there is no separate disinterested party who will declare exactly what the “correct” charges should be. Consequently the Medicare bill‐ ing data have no reliable target variable indicating fraud, and a super‐ vised learning approach that could work for credit card fraud is not applicable. Such a problem usually requires unsupervised approaches such as profiling, clustering, anomaly detection, and co-occurrence grouping. The Data Mining Process | 35 The fact that both of these are fraud detection problems is a superficial similarity that is actually misleading. In data understanding we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply. It is not unusual for a business problem to contain several data mining tasks, often of different types, and com‐ bining their solutions will be necessary (see Chapter 11). Data Preparation The analytic technologies that we can bring to bear are powerful but they impose certain requirements on the data they use. They often require data to be in a form different from how the data are provided naturally, and some conversion will be necessary. Therefore a data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results. Typical examples of data preparation are converting data to tabular format, removing or inferring missing values, and converting data to different types. Some data mining techniques are designed for sym‐ bolic and categorical data, while others handle only numeric values. In addition, numerical values must often be normalized or scaled so that they are comparable. Standard techniques and rules of thumb are available for doing such conversions. Chapter 3 discusses the most typical format for mining data in some detail. In general, though, this book will not focus on data preparation tech‐ niques, which could be the topic of a book by themselves (Pyle, 1999). We will define basic data formats in following chapters, and will only be concerned with data preparation details when they shed light on some fundamental principle of data science or are necessary to present a concrete example. More generally, data scientists may spend considerable time early in the process defining the variables used later in the process. This is one of the main points at which human creativity, common sense, and business knowledge come into play. Often the quality of the data mining solution rests on how well the analysts structure the problems 36 | Business Problems and Data Science Solutions and craft the variables (and sometimes it can be surprisingly hard for them to admit it). One very general and important concern during data preparation is to beware of “leaks” (Kaufman et al. 2012). A leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made. As an example, when predicting whether at a particular point in time a website visitor would end her session or continue surfing to another page, the variable “total number of webpages visited in the session” is predictive. However, the total number of webpages visited in the session would not be known until after the session was over (Kohavi et al., 2000)—at which point one would know the value for the target variable! As another illustra‐ tive example, consider predicting whether a customer will be a “big spender”; knowing the categories of the items purchased (or worse, the amount of tax paid) are very predictive, but are not known at decision-making time (Kohavi & Parekh, 2003). Leakage must be con‐ sidered carefully during data preparation, because data preparation typically is performed after the fact—from historical data. We present a more detailed example of a real leak that was challenging to find in Chapter 14. Modeling Modeling is the subject of the next several chapters and we will not dwell on it here, except to say that the output of modeling is some sort of model or pattern capturing regularities in the data. The modeling stage is the primary place where data mining techniques are applied to the data. It is important to have some understanding of the fundamental ideas of data mining, including the sorts of techni‐ ques and algorithms that exist, because this is the part of the craft where the most science and technology can be brought to bear. Evaluation The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on. If we look hard enough at any dataset we will find patterns, but they may not survive careful scrutiny. We would like to have con‐ fidence that the models and patterns extracted from the data are true The Data Mining Process | 37 2. For example, in one data mining project a model was created to diagnose problems in local phone networks, and to dispatch technicians to the likely site of the problem. Before deployment, a team of phone company stakeholders requested that the model be tweaked so that exceptions were made for hospitals. regularities and not just idiosyncrasies or sample anomalies. It is pos‐ sible to deploy results immediately after data mining but this is inad‐ visable; it is usually far easier, cheaper, quicker, and safer to test a model first in a controlled laboratory setting. Equally important, the evaluation stage also serves to help ensure that the model satisfies the original business goals. Recall that the primary goal of data science for business is to support decision making, and that we started the process by focusing on the business problem we would like to solve. Usually a data mining solution is only a piece of the larger solution, and it needs to be evaluated as such. Further, even if a model passes strict evaluation tests in “in the lab,” there may be external considerations that make it impractical. For example, a com‐ mon flaw with detection solutions (such as fraud detection, spam de‐ tection, and intrusion monitoring) is that they produce too many false alarms. A model may be extremely accurate (> 99%) by laboratory standards, but evaluation in the actual business context may reveal that it still produces too many false alarms to be economically feasible. (How much would it cost to provide the staff to deal with all those false alarms? What would be the cost in customer dissatisfaction?) Evaluating the results of data mining includes both quantitative and qualitative assessments. Various stakeholders have interests in the business decision-making that will be accomplished or supported by the resultant models. In many cases, these stakeholders need to “sign off” on the deployment of the models, and in order to do so need to be satisfied by the quality of the model’s decisions. What that means varies from application to application, but often stakeholders are look‐ ing to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes.2 To facilitate such qualitative assessment, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists). And if the model itself is not comprehensible (e.g., maybe the model is a very complex mathematical formula), how can the data scientists work to make the behavior of the model be com‐ prehensible. 38 | Business Problems and Data Science Solutions Finally, a comprehensive evaluation framework is important because getting detailed information on the performance of a deployed model may be difficult or impossible. Often there is only limited access to the deployment environment so making a comprehensive evaluation “in production” is difficult. Deployed systems typically contain many “moving parts,” and assessing the contribution of a single part is dif‐ ficult. Firms with sophisticated data science teams wisely build testbed environments that mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment. Nonetheless, in some cases we may want to extend evaluation into the development environment, for example by instrumenting a live sys‐ tem to be able to conduct randomized experiments. In our churn ex‐ ample, if we have decided from laboratory tests that a data mined model will give us better churn reduction, we may want to move on to an “in vivo” evaluation, in which a live system randomly applies the model to some customers while keeping other customers as a control group (recall our discussion of causal modeling from “Introduction: Data-Analytic Thinking”). Such experiments must be designed care‐ fully, and the technical details are beyond the scope of this book. The interested reader could start with the lessons-learned articles by Ron Kohavi and his coauthors (Kohavi et al., 2007, 2009, 2012). We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision- making. For example, behavior can change—in some cases, like fraud or spam, in direct response to the deployment of models. Additionally, the output of the model is critically dependent on the input data; input data can change in format and in substance, often without any alerting of the data science team. Raeder et al. (2012) present a detailed dis‐ cussion of system design to help deal with these and other related evaluation-in-deployment issues. Deployment In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment. The clearest cases of deployment involve implementing a predictive model in some information system or busi‐ ness process. In our churn example, a model for predicting the likeli‐ hood of churn could be integrated with the business process for churn management—for example, by sending special offers to customers The Data Mining Process | 39 who are predicted to be particularly at risk. (We will discuss this in increasing detail as the book proceeds.) A new fraud detection model may be built into a workforce management information system, to monitor accounts and create “cases” for fraud analysts to examine. Increasingly, the data mining techniques themselves are deployed. For example, for targeting online advertisements, systems are deployed that automatically build (and test) models in production when a new advertising campaign is presented. Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually. In these cases, it may be best to deploy the data mining phase into production. In doing so, it is critical to instrument the process to alert the data science team of any seeming anomalies and to provide fail-safe operation (Raeder et al., 2012). Deployment can also be much less “technical.” In a celebrated case, data mining discovered a set of rules that could help to quickly diag‐ nose and fix a common error in industrial printing. The deployment succeeded simply by taping a sheet of paper containing the rules to the side of the printers (Evans & Fisher, 2002). Deployment can also be much more subtle, such as a change to data acquisition procedures, or a change to strategy, marketing, or operations resulting from in‐ sight gained from mining the data. Deploying a model into a production system typically requires that the model be re-coded for the production environment, usually for greater speed or compatibility with an existing system. This may incur substantial expense and investment. In many cases, the data science team is responsible for producing a working prototype, along with its evaluation. These are passed to a development team. Practically speaking, there are risks with “over the wall” transfers from data science to development. It may be helpful to remember the max‐ im: “Your model is not what the data scientists design, it’s what the engineers build.” From a management perspective, it is advisable to have members of the development team involved early on in the data 40 | Business Problems and Data Science Solutions 3. Software professionals may recognize the similarity to the philosophy of “Fail faster to succeed sooner” (Muoio, 1997). science project. They can begin as advisors, providing critical insight to the data science team. Increasingly in practice, these particular de‐ velopers are “data science engineers”—software engineers who have particular expertise both in the production systems and in data sci‐ ence. These developers gradually assume more responsibility as the project matures. At some point the developers will take the lead and assume ownership of the product. Generally, the data scientists should still remain involved in the project into final deployment, as advisors or as developers depending on their skills. Regardless of whether deployment is successful, the process often re‐ turns to the Business Understanding phase. The process of mining data produces a great deal of insight into the business problem and the difficulties of its solution. A second iteration can yield an improved solution. Just the experience of thinking about the business, the data, and the performance goals often leads to new ideas for improving business performance, and even new lines of business or new ventures. Note that it is not necessary to fail in deployment to start the cycle again. The Evaluation stage may reveal that results are not good enough to deploy, and we need to adjust the problem definition or get different data. This is represented by the “shortcut” link from Evalu‐ ation back to Business Understanding in the process diagram. In prac‐ tice, there should be shortcuts back from each stage to each prior one because the process always retains some exploratory aspects, and a project should be flexible enough to revisit prior steps based on dis‐ coveries made.3 Implications for Managing the Data Science Team It is tempting—but usually a mistake—to view the data mining process as a software development cycle. Indeed, data mining projects are often treated and managed as engineering projects, which is under‐ standable when they are initiated by software departments, with data generated by a large software system and analytics results fed back into Implications for Managing the Data Science Team | 41 it. Managers are usually familiar with software technologies and are comfortable managing software projects. Milestones can be agreed upon and success is usually unambiguous. Software managers might look at the CRISP data mining cycle (Figure 2-2) and think it looks comfortably similar to a software development cycle, so they should be right at home managing an analytics project the same way. This can be a mistake because data mining is an exploratory under‐ taking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on ap‐ proaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the funda‐ mental understanding of the problem. Engineering a data mining sol‐ ution directly for deployment can be an expensive premature com‐ mitment. Instead, analytics projects should prepare to invest in infor‐ mation to reduce uncertainty in various ways. Small investments can be made via pilot studies and throwaway prototypes. Data scientists should review the literature to see what else has been done and how it has worked. On a larger scale, a team can invest substantially in build‐ ing experimental testbeds to allow extensive agile experimentation. If you’re a software manager, this will look more like research and ex‐ ploration than you’re used to, and maybe more than you’re comfort‐ able with. Software skills versus analytics skills Although data mining involves software, it also requires skills that may not be common among programmers. In software engineering, the ability to write efficient, high-quality code from requirements may be paramount. Team members may be evaluated using software met‐ rics such as the amount of code written or number of bug tickets closed. In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make rea‐ sonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results. In building a data science team, these qualities, rather than traditional software engineering expertise, are skills that should be sought. 42 | Business Problems and Data Science Solutions 4. It is important to keep in mind that it is rare for the discovery to be completely auto‐ mated. The important factor is that data mining automates at least partially the search and discovery process, rather than providing technical support for manual search and discovery. Other Analytics Techniques and Technologies Business analytics involves the application of various technologies to the analysis of data. Many of these go beyond this book’s focus on data- analytic thinking and the principles of extracting useful patterns from data. Nonetheless, it is important to be acquainted with these related techniques, to understand what their goals are, what role they play, and when it may be beneficial to consult experts in them. To this end, we present six groups of related analytic techniques. Where appropriate we draw comparisons and contrasts with data mining. The main difference is that data mining focuses on the auto‐ mated search for knowledge, patterns, or regularities from data.4 An important skill for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular problem. Statistics The term “statistics” has two different uses in business analytics. First, it is used as a catchall term for the computation of particular numeric values of interest from data (e.g., “We need to gather some statistics on our customers’ usage to determine what’s going wrong here.”) These values often include sums, averages, rates, and so on. Let’s call these “summary statistics.” Often we want to dig deeper, and calculate sum‐ mary statistics conditionally on one or more subsets of the population (e.g., “Does the churn rate differ between male and female customers?” and “What about high-income customers in the Northeast (denotes a region of the USA)?”) Summary statistics are the basic building blocks of much data science theory and practice. Summary statistics should be chosen with close attention to the busi‐ ness problem to be solved (one of the fundamental principles we will present later), and also with attention to the distribution of the data they are summarizing. For example, the average (mean) income in the United States according to the 2004 Census Bureau Economic Survey was over $60,000. If we were to use that as a measure of the average income in order to make policy decisions, we would be misleading Other Analytics Techniques and Technologies | 43 ourselves. The distribution of incomes in the U.S. is highly skewed, with many people making relatively little and some people making fantastically much. In such cases, the arithmetic mean tells us relatively little about how much people are making. Instead, we should use a different measure of “average” income, such as the median. The me‐ dian income—that amount where half the population makes more and half makes less—in the U.S. in the 2004 Census study was only $44,389 —considerably less than the mean. This example may seem obvious because we are so accustomed to hearing about the “median income,” but the same reasoning applies to any computation of summary sta‐ tistics: have you thought about the problem you would like to solve or the question you would like to answer? Have you considered the dis‐ tribution of the data, and whether the chosen statistic is appropriate? The other use of the term “statistics” is to denote the field of study that goes by that name, for which we might differentiate by using the proper name, Statistics. The field of Statistics provides us with a huge amount of knowledge that underlies analytics, and can be thought of as a com‐ ponent of the larger field of Data Science. For example, Statistics helps us to understand different data distributions and what statistics are appropriate to summarize each. Statistics helps us understand how to use data to test hypotheses and to estimate the uncertainty of conclu‐ sions. In relation to data mining, hypothesis testing can help determine whether an observed pattern is likely to be a valid, general regularity as opposed to a chance occurrence in some particular dataset. Most relevant to this book, many of the techniques for extracting models or patterns from data have their roots in Statistics. For example, a preliminary study may suggest that customers in the Northeast have a churn rate of 22.5%, whereas the nationwide average churn rate is only 15%. This may be just a chance fluctuation since the churn rate is not constant; it varies over regions and over time, so differences are to be expected. But the Northeast rate is one and a half times the U.S. average, which seems unusually high. What is the chance that this is due to random variation? Statistical hypothesis testing is used to answer such questions. Closely related is the quantification of uncertainty into confidence in‐ tervals. The overall churn rate is 15%, but there is some variation; traditional statistical analysis may reveal that 95% of the time the churn rate is expected to fall between 13% and 17%. 44 | Business Problems and Data Science Solutions This contrasts with the (complementary) process of data mining, which may be seen as hypothesis generation. Can we find patterns in data in the first place? Hypothesis generation should then be followed by careful hypothesis testing (generally on different data; see Chapter 5). In addition, data mining procedures may produce numerical esti‐ mates, and we often also want to provide confidence intervals on these estimates. We will return to this when we discuss the evaluation of the results of data mining. In this book we are not going to spend more time discussing these basic statistical concepts. There are plenty of introductory books on statistics and statistics for business, and any treatment we would try to squeeze in would be either very narrow or superficial. That said, one statistical term that is often heard in the context of business analytics is “correlation.” For example, “Are there any indi‐ cators that correlate with a customer’s later defection?” As with the term statistics, “correlation” has both a general-purpose meaning (var‐ iations in one quantity tell us something about variations in the other), and a specific technical meaning (e.g., linear correlation based on a particular mathematical formula). The notion of correlation will be the jumping off point for the rest of our discussion of data science for business, starting in the next chapter. Database Querying A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database sys‐ tem. Many tools are available to answer one-off or repeating queries about data posed by an analyst. These tools are usually frontends to database systems, based on Structured Query Language (SQL) or a tool with a graphical user interface (GUI) to help formulate queries (e.g., query-by-example, or QBE). For example, if the analyst can de‐ fine “profitable” in operational terms computable from items in the database, then a query tool could answer: “Who are the most profitable customers in the Northeast?” The analyst may then run the query to retrieve a list of the most profitable customers, possibly ranked by profitability. This activity differs fundamentally from data mining in that there is no discovery of patterns or models. Database queries are appropriate when an analyst already has an idea of what might be an interesting subpopulation of the data, and wants to investigate this population or confirm a hypothesis about it. For Other Analytics Techniques and Technologies | 45 example, if an analyst suspects that middle-aged men living in the Northeast have some particularly interesting churning behavior, she could compose a SQL query: SELECT * FROM CUSTOMERS WHERE AGE > 45 and SEX='M' and DOMICILE = 'NE' If those are the people to be targeted with an offer, a query tool can be used to retrieve all of the information about them (“*”) from the CUS TOMERS table in the database. In contrast, data mining could be used to come up with this query in the first place—as a pattern or regularity in the data. A data mining procedure might examine prior customers who did and did not defect, and determine that this segment (characterized as “AGE is greater than 45 and SEX is male and DOMICILE is Northeast-USA”) is predictive with respect to churn rate. After translating this into a SQL query, a query tool could then be used to find the matching records in the database. Query tools generally have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, and more. Data sci‐ entists often become quite adept at writing queries to extract the data they need. On-line Analytical Processing (OLAP) provides an easy-to-use GUI to query large data collections, for the purpose of facilitating data ex‐ ploration. The idea of “on-line” processing is that it is done in realtime, so analysts and decision makers can find answers to their queries quickly and efficiently. Unlike the “ad hoc” querying enabled by tools like SQL, for OLAP the dimensions of analysis must be pre- programmed into the OLAP system. If we’ve foreseen that we would want to explore sales volume by region and time, we could have these three dimensions programmed into the system, and drill down into populations, often simply by clicking and dragging and manipulating dynamic charts. OLAP systems are designed to facilitate manual or visual exploration of the data by analysts. OLAP performs no modeling or automatic pattern finding. As an additional contrast, unlike with OLAP, data mining tools generally can incorporate new dimensions of analysis easily as part of the exploration. OLAP tools can be a useful comple‐ ment to data mining tools for discovery from business data. 46 | Business Problems and Data Science Solutions 5. The interested reader is urged to read the discussion by Shmueli (2010). Data Warehousing Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database. Analytical systems can access data warehouses. Data ware‐ housing may be seen as a facilitating technology of data mining. It is not always necessary, as most data mining does not access a data ware‐ house, but firms that decide to invest in data warehouses often can apply data mining more broadly and more deeply in the organization. For example, if a data warehouse integrates records from sales and billing as well as from human resources, it can be used to find char‐ acteristic patterns of effective salespeople. Regression Analysis Some of the same methods we discuss in this book are at the core of a different set of analytic methods, which often are collected under the rubric regression analysis, and are widely applied in the field of statistics and also in other fields founded on econometric analysis. This book will focus on different issues than usually encountered in a regression analysis book or class. Here we are less interested in explaining a par‐ ticular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process. Typically, this will involve estimating or predicting values for cases that are not in the analyzed data set. So, as an example, in this book we are less interested in digging into the reasons for churn (important as they may be) in a particular historical set of data, and more inter‐ ested in predicting which customers who have not yet left would be the best to target to reduce future churn. Therefore, we will spend some time talking about testing patterns on new data to evaluate their gen‐ erality, and about techniques for reducing the tendency to find pat‐ terns specific to a particular set of data, but that do not generalize to the population from which the data come. The topic of explanatory modeling versus predictive modeling can elicit deep-felt debate,5 which goes well beyond our focus. What is important is to realize that there is considerable overlap in the techni‐ ques used, but that the lessons learned from explanatory modeling do not all apply to predictive modeling. So a reader with some back‐ Other Analytics Techniques and Technologies | 47 6. Those who pursue the study in depth will have the seeming contradictions worked out. Such deep study is not necessary to understand the fundamental principles. ground in regression analysis may encounter new and even seemingly contradictory lessons.6 Machine Learning and Data Mining The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields contemporaneously, most notably Machine Learning, Applied Statistics, and Pattern Recognition. Machine Learning as a field of study arose as a subfield of Artificial Intelligence, which was concerned with methods for improving the knowledge or perfor‐ mance of an intelligent agent over time, in response to the agent’s ex‐ perience in the world. Such improvement often involves analyzing data from the environment and making predictions about unknown quantities, and over the years this data analysis aspect of machine learning has come to play a very large role in the field. As machine learning methods were deployed broadly, the scientific disciplines of Machine Learning, Applied Statistics, and Pattern Recognition devel‐ oped close ties, and the separation between the fields has blurred. The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective. Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition—how will an intelligent agent use learned knowledge to reason and act in its environment— which are not concerns of Data Mining. Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more con‐ cerned with applications than Machine Learning is. As such, research 48 | Business Problems and Data Science Solutions focused on commercial applications and business issues of data anal‐ ysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on. Answering Business Questions with These Techniques To illustrate how these techniques apply to business analytics, consider a set of questions that may arise and the technologies that would be appropriate for answering them. These questions are all related but each is subtly different. It is important to understand these differences in order to understand what technologies one needs to employ and what people may be necessary to consult. 1. Who are the most profitable customers? If “profitable” can be defined clearly based on existing data, this is a straightforward database query. A standard query tool could be used to retrieve a set of customer records from a database. The results could be sorted by cumulative transaction amount, or some other operational indicator of profitability. 2. Is there really a difference between the profitable customers and the average customer? This is a question about a conjecture or hypothesis (in this case, “There is a difference in value to the company between the prof‐ itable customers and the average customer”), and statistical hy‐ pothesis testing would be used to confirm or disconfirm it. Stat‐ istical analysis could also derive a probability or confidence bound that the difference was real. Typically, the result would be like: “The value of these profitable customers is significantly different from that of the average customer, with probability < 5% that this is due to random chance.” 3. But who really are these customers? Can I characterize them? We often would like to do more than just list out the profitable customers. We would like to describe common characteristics of profitable customers. The characteristics of individual customers can be extracted from a database using techniques such as data‐ base querying, which also can be used to generate summary sta‐ tistics. A deeper analysis should involve determining what char‐ acteristics differentiate profitable customers from unprofitable Other Analytics Techniques and Technologies | 49 ones. This is the realm of data science, using data mining techni‐ ques for automated pattern finding—which we discuss in depth in the subsequent chapters. 4. Will some particular new customer be profitable? How much reve‐ nue should I expect this customer to generate? These questions could be addressed by data mining techniques that examine historical customer records and produce predictive models of profitability. Such techniques would generate models from historical data that could then be applied to new customers to generate predictions. Again, this is the subject of the following chapters. Note that this last pair of questions are subtly different data mining questions. The first, a classification question, may be phrased as a pre‐ diction of whether a given new customer will be profitable (yes/no or the probability thereof). The second may be phrased as a prediction of the value (numerical) that the customer will bring to the company. More on that as we proceed. Summary Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. We will refer back to the data mining process repeatedly throughout the book, showing how each fundamental concept fits in. In turn, understanding the fundamentals of data science substantially improves the chances of success as an enterprise invokes the data mining process. The various fields of study related to data science have developed a set of canonical task types, such as classification, regression, and cluster‐ ing. Each task type serves a different purpose and has an associated set of solution techniques. A data scientist typically attacks a new project by decomposing it such that one or more of these canonical tasks is revealed, choosing a solution technique for each, then composing the solutions. Doing this expertly may take considerable experience and skill. A successful data mining project involves an intelligent compro‐ mise between what the data can do (i.e., what they can predict, and how well) and the project goals. For this reason it is important to keep 50 | Business Problems and Data Science Solutions in mind how data mining results will be used, and use this to inform the data mining process itself. Data mining differs from, and is complementary to, important sup‐ porting technologies such as statistical hypothesis testing and database querying (which have their own books and classes). Though the boundaries between data mining and related techniques are not always sharp, it is important to know about other techniques’ capabilities and strengths to know when they should be used. To a business manager, the data mining process is useful as a frame‐ work for analyzing a data mining project or proposal. The process provides a systematic organization, including a set of questions that can be asked about a project or a proposed project to help understand whether the project is well conceived or is fundamentally flawed. We will return to this after we have discussed in detail some more of the fundamental principles themselves—to which we turn now. Would You Like to Read More? Visit our website to purchase the full version of Data Science for Busi‐ ness. Summary | 51 Sep 25 – 27, 2013 Boston, MA Oct 28 – 30, 2013 New York, NY Nov 11 – 13, 2013 London, England ©2013 O’Reilly Media, Inc. O’Reilly logo is a registered trademark of O’Reilly Media, Inc. 13110 Change the world with data. We’ll show you how. strataconf.com Save 20% with code STRATADATA PART II Bad Data Handbook 1. http://www.weotta.com Detecting Liars and the Confused in Contradictory Online Reviews Jacob Perkins Did you know that people lie for their own selfish reasons? Even if this is totally obvious to you, you may be surprised at how blatant this practice has become online, to the point where some people will ex‐ plain their reasons for lying immediately after doing so. I knew unethical people would lie in online reviews in order to inflate ratings or attack competitors, but what I didn’t know, and only learned by accident, is that individuals will sometimes write reviews that com‐ pletely contradict their associated rating, without any regard to how it affects a business’s online reputation. And often this is for businesses that an individual likes. How did I learn this? By using ratings and reviews to create a sentiment corpus, I trained a sentiment analysis classifier that could reliably de‐ termine the sentiment of a review. While evaluating this classifier, I discovered that it could also detect discrepancies between the review sentiment and the corresponding rating, thereby finding liars and confused reviewers. Here’s the whole story of how I used text classifi‐ cation to identify an unexpected source of bad data… Weotta At my company, Weotta,1 we produce applications and APIs for nav‐ igating local data in ways that people actually care about, so we can 55 2. http://en.wikipedia.org/wiki/Natural_language_processing 3. http://citygrid.com/ answer questions like: Is there a kid-friendly restaurant nearby? What’s the nearest hip yoga studio? What concerts are happening this week‐ end? To do this, we analyze, aggregate, and organize local data in order to classify it along dimensions that we can use to answer these questions. This classification process enables us to know which restaurants are classy, which bars are divey, and where you should go on a first date. Online business reviews are one of the major input signals we use to determine these classifications. Reviews can tell us the positive or negative sentiment of the reviewer, as well as what they specifically care about, such as quality of service, ambience, and value. When we aggregate reviews, we can learn what’s popular about the place and why people like or dislike it. We use many other signals besides reviews, but with the proper application of natural language processing,2 re‐ views are a rich source of significant information. Getting Reviews Toget reviews, we use APIs where possible, but most reviews are found using good old-fashioned web scraping. If you can use an API like CityGrid3 to get the data you need, it will make your life much easier, because while scraping isn’t necessarily difficult, it can be very frus‐ trating. Website HTML can change without notice, and only the sim‐ plest or most advanced scraping logic will remain unaffected. But the majority of web scrapers will break on even the smallest of HTML changes, forcing you to continually monitor and maintain your scra‐ pers. This is the dirty secret of web mining: the end result might be nice and polished data, but the process is more akin to janitorial work where every mess is unique and it never stays clean for long. Once you’ve got reviews, you can aggregate ratings to calculate an average rating for a business. One problem is that many sources don’t include ratings with their reviews. So how can you accurately calculate an average rating? We wanted to do this for our data, as well as aggre‐ gate the overall positive sentiment from all the reviews for a business, independent of any average rating. With that in mind, I figured I could 56 | Detecting Liars and the Confused in Contradictory Online Reviews 4. http://bit.ly/X9sqWR 5. http://nltk.org 6. http://en.wikipedia.org/wiki/Text_classification 7. http://www.cs.cornell.edu/people/pabo/movie-review-data/ 8. http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf create a sentiment classifier,4 using rated reviews as a training corpus. A classifier works by taking a feature set and determining a label. For sentiment analysis, a feature set is a piece of text, like a review, and the possible labels can be pos for positive text, and neg for negative text. Such a sentiment classifier could be run over a business’s reviews in order to calculate an overall sentiment, and to make up for any missing rating information. Sentiment Classification NLTK,5 Python’sNatural Language ToolKit, is a very useful program‐ ming library for doing natural language processing and text classifi‐ cation.6 It also comes with many corpora that you can use for training and testing. One of these is the movie_reviews corpus,7 and if you’re just learning how to do sentiment classification, this is a good corpus to start with. It is organized into two directories, pos and neg. In each directory is a set of files containing movie reviews, with every review separated by a blank line. This corpus was created by Pang and Lee,8 and they used ratings that came with each review to decide whether that review belonged in pos or neg. So in a 5-star rating system, 3.5 stars and higher reviews went into the pos directory, while 2.5 stars and lower reviews went into the neg directory. The assumption behind this is that high rated reviews will have positive language, and low rated reviews will have more negative language. Polarized language is ideal for text classification, because the classifier can learn much more pre‐ cisely those words that indicate pos and those words that indicate neg. Because I needed sentiment analysis for local businesses, not movies, I used a similar method to create my own sentiment training corpus for local business reviews. From a selection of businesses, I produced a corpus where the pos text came from 5 star reviews, and the neg text came from 1 star reviews. I actually started by using both 4 and 5 star reviews for pos, and 1 and 2 star reviews for neg, but after a number of training experiments, it was clear that the 2 and 4 star reviews had Sentiment Classification | 57 9. http://bit.ly/QibGfE less polarizing language, and therefore introduced too much noise, decreasing the accuracy of the classifier. So my initial assumption was correct, though the implementation of it was not ideal. But because I created the training data, I had the power to change it in order to yield a more effective classifier. All training-based machine learning meth‐ ods work on the general principle of “garbage in, garbage out,” so if your training data is no good, do whatever you can to make it better before you start trying to get fancy with algorithms. Polarized Language To illustrate the power of polarized language, what follows is a table showing some of the most polarized words used in the movie_re views corpus, along with the occurrence count in each category, and the Chi-Squared information gain, calculated using NLTK’s BigramAs socMeasures.chi_sq() function in the nltk.metrics.association module.9 This Chi-Square metric is a modification of the Phi-Square measure of the association between two variables, in this case a word and a category. Given the number of times a word appears in the pos category, the total number of times it occurs in all categories, the number of words in the pos category, and the total number of words in all categories, we get an association score between a word and the pos category. Because we only have two categories, the pos and neg scores for a given word will be the same because the word is equally significant either way, but the interpretation of that significance de‐ pends on the word’s relative frequency in each category. Word Pos Neg Chi-Sq bad 361 1034 399 worst 49 259 166 stupid 45 208 123 life 1057 529 126 boring 52 218 120 truman 152 11 108 why 317 567 99 great 751 397 76 war 275 94 71 58 | Detecting Liars and the Confused in Contradictory Online Reviews 10. http://en.wikipedia.org/wiki/Naive_Bayes_classifier 11. http://en.wikipedia.org/wiki/Maximum_entropy_classifier Word Pos Neg Chi-Sq awful 21 111 71 Some of the above words and numbers should be mostly obvious, but others not so much. For example, many people might think “war” is bad, but clearly that doesn’t apply to movies. And people tend to ask “why” more often in negative reviews compared to positive reviews. Negative adjectives are also more common, or at least provide more information for classification than the positive adjectives. But there can still be category overlap with these adjectives, such as “not bad” in a positive review, or “not great” in a negative review. Let’s compare this to some of the more common and less polarizing words: Word Pos Neg Chi-Sq good 1248 1163 0.6361 crime 115 93 0.6180 romantic 137 117 0.1913 movies 635 571 0.0036 hair 57 52 0.0033 produced 67 61 0.0026 bob 96 86 0.0024 where 785 707 0.0013 face 188 169 0.0013 take 476 429 0.0003 You can see from this that “good” is one of the most overused adjectives and confers very little information. And you shouldn’t name a movie character “Bob” if you want a clear and strong audience reaction. When training a text classifier, these low-information words are harm‐ ful noise, and as such should either be discarded or weighted down, depending on the classification algorithm you use. Naive Bayes10 in particular does not do well with noisy data, while a Logistic Regression classifier (also known as Maximum Entropy)11 can weigh noisy fea‐ tures down to the point of insignificance. Polarized Language | 59 Corpus Creation Here’s a little more detail on the corpus creation: we counted each review as a single instance. The simplest way to do this is to replace all newlines in a review with a single space, thereby ensuring that each review looks like one paragraph. Then separate each review/paragraph by a blank line, so that it is easy to identify when one review ends and the next begins. Depending on the number of reviews per category, you may want to have multiple files per category, each containing some reasonable number of reviews separated by blank lines. With multiple files, you should either have separate directories for pos and neg, like the movie_reviews corpus, or you could use easily identified filename patterns. The simplest way to do it is to copy something that already exists, so you can reuse any code that recognizes that organizational pattern. The number of reviews per file is up to you; what really matters is the number of reviews per category. Ideally you want at least 1000 reviews in each category; I try to aim for at least 10,000, if possible. You want enough reviews to reduce the bias of any individual reviewer or item being reviewed, and to ensure that you get a good number of significant words in each category, so the classifier can learn effectively. The other thing you need to be concerned about is category balance. After producing a corpus of 1 and 5 star reviews, I had to limit the number of pos reviews significantly in order to balance the pos and neg categories, because it turns out that there’s far more 5 star reviews than there are 1 star reviews. It seems that online, most businesses are above average, as you can see in this chart showing the percentage of each rating. Rating Percent 5 32% 4 35% 3 17% 2 9% 1 7% People are clearly biased towards higher rated reviews; there are nearly five times as many 5 star reviews as 1 star reviews. So it might make sense that a sentiment classifier should be biased the same way, and all else being equal, favor pos classifications over neg. But there’s a design problem here: if a sentiment classifier is more biased towards the pos 60 | Detecting Liars and the Confused in Contradictory Online Reviews 12. https://github.com/japerk/nltk-trainer class, it will produce more false positives. And if you plan on surfacing these positive reviews, showing them to normal people that have no insight into how a sentiment classifier works, you really don’t want to show a false positive review. There’s a lot of cognitive dissonance when you claim that a business is highly rated and most people like it, while at the same time showing a negative review. One of the worst things you can do when designing a user interface is to show conflicting messages at the same time. So by balancing the pos and neg categories, I was able to reduce that bias and decrease false positives. This was accomplished by simply pruning the number of pos reviews until it was equal to the number of negreviews. Training a Classifier Now that I had a polarized and balanced training corpus, it was trivial to train a classifier using a classifier training script from nltk-trainer. 12 nltk-trainer is an open source library of scripts I created for training and analyzing NLTK models. For text classification, the appropriate script is train_classifier.py. Just a few hours of experimentation lead to a highly accurate classifier. Below is an example of how to use train_classifier.py, and the kind of stats I saw: nltk-trainer$ ./train_classifier.py review_sentiment --no- pickle \ --classifier MEGAM --ngrams 1 --ngrams 2 --instances paras \ --fraction 0.75 loading review_sentiment 2 labels: ['neg', 'pos'] 22500 training feats, 7500 testing feats [Found megam: /usr/local/bin/megam] training MEGAM classifier accuracy: 0.913465 neg precision: 0.891415 neg recall: 0.931725 pos precision: 0.947058 pos recall: 0.910265 With these arguments, I’m using the MEGAM algorithm for training a MaxentClassifier using each review paragraph as a single instance, looking at both single words (unigrams) and pairs of words (bigrams). The MaxentClassifier (or Logistic Regression), uses an iterative al‐ Training a Classifier | 61 13. http://en.wikipedia.org/wiki/Part-of-speech_tagging 14. http://en.wikipedia.org/wiki/Chunking_(computational_linguistics) gorithm to determine weights for every feature. These weights can be positive or negative for a category, meaning that the presence of a word can imply that a feature set belongs to a category and/or that a feature set does not belong to different category. So referring to the previous word tables, we can expect that “worst” will have a positive weight for the neg category, and a negative weight for the pos category. The MEGAM algorithm is just one of many available training algorithms, and I pre‐ fer it for its speed, memory efficiency, and slight accuracy advantage over the other available algorithms. The other options used above are --no-pickle, which means to not save the trained classifier to disk, and --fraction, which specifies how much of the corpus is used for training, with the remaining fraction used for testing. train_classifier.py has many other options, which you can see by using the --help option. These include various algorithm-specific training options, what constitutes an instance, which ngrams to use, and many more. If you’re familiar with classification algorithms, you may be wondering why I didn’t use Naive Bayes. This is because my tests showed that Naive Bayes was much less accurate than Maxent, and that even com‐ bining the two algorithms did not beat Maxent by itself. Naive Bayes does not weight its features, and therefore tends to be susceptible to noisy data, which I believe is the reason it did not perform too well in this case. But your data is probably different, and you may find oppo‐ site results when you conduct your experiments. I actually wrote the original code behind train_classifier.py for this project so that I could design and modify classifier training ex‐ periments very quickly. Instead of copy and paste coding and endless script modifications, I was able to simply tweak command line argu‐ ments to try out a different set of training parameters. I encourage you to do the same, and to perform many training experiments in order to arrive at the best possible set of options. After I’d created this script for text classification, I added training scripts for part-of-speech tagging13 and chunking,14 leading to the cre‐ ation of the whole nltk-trainer project and its suite of training and analysis scripts. I highly recommend trying these out before attempt‐ 62 | Detecting Liars and the Confused in Contradictory Online Reviews ing to create a custom NLTK based classifier, or any NLTK model, unless you really want to know how the code works, and/or have cus‐ tom feature extraction methods you want to use. Validating the Classifier Butback to the sentiment classifier: no matter what the statistics say, over the years I’ve learned to not fully trust trained models and their input data. Unless every training instance has been hand-verified by three professional reviewers, you can assume there’s some noise and/ or inaccuracy in your training data. So once I had trained what ap‐ peared to be a highly accurate sentiment classifier, I ran it over my training corpus in order to see if I could find reviews that were mis‐ classified by the classifier. My goal was to figure out where the classifier went wrong, and perhaps get some insight into how to tweak the training parameters for better results. To my surpise, I found reviews like this in the pos/5-star category, which the classifier was classifying as neg: It was loud and the wine by the glass is soo expensive. Thats the only negative because it was good. And in the neg/1-star category, there were pos reviews like this: One of the best places in New York for a romantic evening. Great food and service at fair prices. We loved it! The waiters were great and the food came quickly and it was delicious. 5 star for us! The classifier actually turned out to be more accurate at detecting sentiment than the ratings used to create the training corpus! Appa‐ rently, one of the many bizarre things people do online is write reviews that completely contradict their rating. While trying to create a sen‐ timent classifier, I had accidentally created a way to identify both liars and the confused. Here’s some 1-star reviews by blatant liars: This is the best BBQ ever! I’m not just saying that to keep you fools from congesting my favorite places. Quit coming to my favorite Karaoke spot. I found it first. While these at least have some logic behind them, I completely disa‐ gree with the motivation. By giving 1 star with their review, these re‐ viewers are actively harming the reputation of their favorite businesses for their own selfish short-term gain. Validating the Classifier | 63 On the other side of the sentiment divide, here’s a mixed sentiment comment from a negative 5-star review: This place sucks, do not come here, dirty, unfriendly staff and bad workout equipment. MY club, do you hear me, MY, MY, MY club. STAY AWAY! One of the best clubs in the bay area. All jokes aside, this place is da bomb. This kind of negative 5-star review could also be harming the business’s reputation. The first few sentences may be a joke, but those are also the sentences people are more likely to read, and this review is saying some pretty negative things, albeit jokingly. And then there’s people that really shouldn’t be writing reviews: I like to give A’s. I dont want to hurt anyones feelings. A- is the lowest I like to give. A- is the new F. The whole point of reviews and ratings is to express your opinion, and yet this reviewer seems afraid to do just that. And here’s an actual negative opinion from a 5-star review: My steak was way over-cooked. The menu is very limited. Too few choices If the above review came with a 1 or 2 star rating, that’d make sense. But a 5-star rating for a limited menu and overcooked steak? I’m not the only one who’s confused. Finally, just to show that my sentiment classifier isn’t perfect, here’s a 5-star review that’s actually positive, but the reviewer uses a double negative, which causes the classifier to give it a negative sentiment: Never had a disappointing meal there. Double negatives, negations such as “not good,” sarcasm, and other language idioms can often confuse sentiment analysis systems and are an area of ongoing research. Because of this, I believe that it’s best to exclude such reviews from any metrics about a business. If you want a clear signal, you often have to ignore small bits of contradictory information. Designing with Data Weuse the sentiment classifier in another way, too. As I mentioned earlier, we show reviews of places to provide our users with additional context and confirmation. And because we try to show only the best places, the reviews we show should reflect that. This means that every review we show needs to have a strong positive signal. And if there’s a 64 | Detecting Liars and the Confused in Contradictory Online Reviews 15. http://www.slideshare.net/japerk/corpus-bootstrapping-with-nltk rating included with the review, it needs to be high too, because we don’t want to show any confused high-rated reviews or duplicitous low-rated reviews. Otherwise, we’d just confuse our own users. Before adding the sentiment classifier as a critical component of our review selection method, we were simply choosing reviews based on rating. And when we didn’t have a rating, we were choosing the most recent reviews. Neither of these methods was satisfactory. As I’ve shown above, you cannot always trust ratings to accurately reflect the sentiment of a review. And for reviews without a rating, anyone could say anything and we had no signal to use for filtering out the negative reviews. Now some might think we should be showing negative re‐ views to provide a balanced view of a business. But our goal in this case is not to create a research tool—there’s plenty of other sites and apps that are already great for that. Our goal is to show you the best, most relevant places for your occasion. If every other signal is mostly positive, then showing negative reviews is a disservice to our users and results in a poor experience. By choosing to show only positive re‐ views, the data, design, and user experience are all congruent, helping our users choose from the best options available based on their own preferences, without having to do any mental filtering of negative opinions. Lessons Learned One important lesson for machine learning and statistical natural lan‐ guage processing enthusiasts: it’s very important to train your own models on your own data. If I had used classifiers trained on the stan‐ dard movie_reviews corpus, I would never have gotten these results. Movie reviews are simply different than local business reviews. In fact, it might be the case that you’d get even better results by segmenting businesses by type, and creating classifiers for each type of business. I haven’t run this experiment yet, but it might lead to interesting re‐ search. The point is, your models should be trained on the same kind of data they need to analyze if you want high accuracy results. And when it comes to text classification and sentiment analysis in partic‐ ular, the domain really matters. That requires creating a custom corpus 15 and spending at least a few hours on experiments and research to really learn about your data in order to produce good models. Lessons Learned | 65 You must then take a critical look at your training data, and validate your training models against it. This is the only way to know what your model is actually learning, and if your training data is any good. If I hadn’t done any model validation, I would never have discovered these bad reviews, nor realized that my sentiment classifier could detect in‐ consistent opinions and outright lying. In a sense, these bad reviews are a form of noise that has been maliciously injected into the data. So ask yourself, what forms of bad data might be lurking in your data stream? Summary The process I went through can be summarized as: 1. Get relevant data. 2. Create a custom training corpus. 3. Train a model. 4. Validate that model against the training corpus. 5. Discover something interesting. At steps 3-5, you may find that your training corpus is not good enough. It could mean you need to get more relevant data. Or that the data you have is too noisy. In my case, I found that 2 and 4 star reviews were not polarizing enough, and that there was an imbalance between the number of 5-star reviews and the number of 1-star reviews. It’s also possible that your expectations for machine learning are too high, and you need to simplify the problem. Natural language pro‐ cessing and machine learning are imperfect methods that rely onstat‐ istical pattern matching. You cannot expect 100% accuracy, and the noisier the data is, the more likely you are to have lower accuracy. This is why you should always aim for more distinct categories, polarizing language, and simple classification decisions. Resources Allof my examples have used NLTK, Python’s Natural Language Tool‐ Kit, which you can find at http://nltk.org/. I also train all my models using the scripts I created in nltk-trainer at https://github.com/japerk/ nltk-trainer. To learn how to do text classification and sentiment anal‐ ysis with NLTK yourself, I wrote a series of posts on my blog, starting 66 | Detecting Liars and the Confused in Contradictory Online Reviews with http://bit.ly/X9sqWR. And for those who want to go beyond basic text classification, take a look at scikit-learn, which is implementing all the latest and greatest machine learning algorithms in Python: http://scikit-learn.org/stable/. For Java people, there is Apache’s OpenNLP project at http://opennlp.apache.org/, and a commercial li‐ brary called LingPipe, available at http://alias-i.com/lingpipe/. Would You Like to Read More? Visit our website to purchase the full version of Bad Data Handbook. Resources | 67 PART III Mining the Social Web Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Tweet and RT were sitting on a fence. Tweet fell off. Who was left? In this chapter, we’ll largely use CouchDB’s map/reduce capabilities to exploit the entities in tweets (@mentions, #hashtags, etc.) to try to answer the question, “What’s everyone talking about?” With overall throughput now far exceeding 50 million tweets per day and occa‐ sional peak velocities in excess of 3,000 tweets per second, there’s vast potential in mining tweet content, and this is the chapter where we’ll finally dig in. Whereas the previous chapter primarily focused on the social graph linkages that exist among friends and followers, this chapter focuses on learning as much as possible about Twitterers by inspecting the entities that appear in their tweets. You’ll also see ties back to Redis for accessing user data you have harvested previously and from NetworkX for graph analytics. So many tweets, so little time to mine them—let’s get started! It is highly recommended that you read Chapters 3 and 4 before read‐ ing this chapter. Much of its discussion builds upon the foundation those chapters established, including Redis and CouchDB, which are again used in this chapter. Pen : Sword :: Tweet : Machine Gun (?!?) If the pen is mightier than the sword, what does that say about the tweet? There are a number of interesting incidents in which Twitter has saved lives, one of the most notorious being James Karl Buck’s famous “Arrested” tweet that led to his speedy release when he was detained by Egyptian authorities. It doesn’t take too much work to find 71 evidence of similar incidents, as well as countless uses of Twitter for noble fundraising efforts and other benevolent causes. Having an out‐ let really can make a huge difference sometimes. More often than not, though, your home time line (tweet stream) and the public time line are filled with information that’s not quite so dramatic or intriguing. At times like these, cutting out some of the cruft can help you glimpse the big picture. Given that as many as 50 percent of all tweets contain at least one entity that has been intentionally crafted by the tweet au‐ thor, they make a very logical starting point for tweet analysis. In fact, Twitter has recognized their value and begun to directly expose them in the time line API calls, and in early 2010 and as the year unfolded, they increasingly became most standard throughout the entire Twitter API. Consider the tweet in Example 4-1, retrieved from a time line API call with the opt-in include_entities=true parameter specified in the query. Example 4-1. A sample tweet from a search API that illustrates tweet entities { "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "entities" : { "hashtags" : [ { "indices" : [ 97, 103 ], "text" : "gov20" }, { "indices" : [ 104, 112 ], "text" : "opengov" } ], "urls" : [ { "expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG" } ], "user_mentions" : [ { "id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower", "screen_name" : "crowdFlower" } ] }, "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing the Gold- man ... #opengov", "user" : { 72 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet 1. The twitter-text-py module is a port of the twitter-text-rb module (both available via GitHub), which Twitter uses in production. "description" : "Founder and CEO, O'Reilly Media. Watching the alpha ...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim O'Reilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com", } } By default, a tweet specifies a lot of useful information about its author via the user field in the status object, but the tweet entities provide insight into the content of the tweet itself. By briefly inspecting this one sample tweet, we can safely infer that @timoreilly is probably in‐ terested in the transformational topics of open government and Gov‐ ernment 2.0, as indicated by the hashtags included in the tweet. It’s probably also safe to infer that @crowdflower has some relation to Government 2.0 and that the URL may point to such related content. Thus, if you wanted to discover some additional information about the author of this tweet in an automated fashion, you could consider pivoting from @timoreilly over to @crowdflower and exploring that user’s tweets or profile information, spawning a search on the hashtags included in the tweet to see what kind of other information pops up, or following the link and doing some page scraping to learn more about the underlying context of the tweet. Given that there’s so much value to be gained from analyzing tweet entities, you’ll sorely miss them in some APIs or from historical ar‐ chives of Twitter data that are becoming more and more common to mine. Instead of manually parsing them out of the text yourself (not such an easy thing to do when tweets contain arbitrary Unicode char‐ acters), however, just easy_install twitter-text-py1 so that you can focus your efforts on far more interesting problems. The script in Example 4-2 illustrates some basic usage of its Extractor class, which produces a structure similar to the one exposed by the time line APIs. You have everything to gain and nothing to lose by automatically em‐ bedding entities in this manner until tweet entities become the default. As of December 2010, tweet entities were becoming more and more common through the APIs, but were not quite officially “blessed” and the norm. This chapter was written with the assumption that you’d Pen : Sword :: Tweet : Machine Gun (?!?) | 73 want to know how to parse them out for yourself, but you should realize that keeping up with the latest happenings with the Twitter API might save you some work. Manual extraction of tweet entities might also be very helpful for situations in which you’re mining historical archives from organizations such as Infochimps or GNIP. Example 4-2. Extracting tweet entities with a little help from the twit‐ ter_text package (the_tweet__extract_tweet_entities.py) # -*- coding: utf-8 -*- import sys import json import twitter_text import twitter from twitter__login import login # Get a tweet id clicking on a status right off of twitter.com. # For example, http://twitter.com/#!/timoreilly/status/ 17386521699024896 TWEET_ID = sys.argv[1] # You may need to setup your OAuth settings in twitter__login.py t = login() def getEntities(tweet): # Now extract various entities from it and build up a familiar structure extractor = twitter_text.Extractor(tweet['text']) # Note that the production Twitter API contains a few addition- al fields in # the entities hash that would require additional API calls to resolve entities = {} entities['user_mentions'] = [] for um in extractor.extract_mentioned_screen_names_with_indi- ces(): entities['user_mentions'].append(um) entities['hashtags'] = [] for ht in extractor.extract_hashtags_with_indices(): # massage field name to match production twitter api ht['text'] = ht['hashtag'] 74 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet 2. The Twitter API documentation states that the friend time line is similar to the home time line, except that it does not contain retweets for backward-compatibility purposes. del ht['hashtag'] entities['hashtags'].append(ht) entities['urls'] = [] for url in extractor.extract_urls_with_indices(): entities['urls'].append(url) return entities # Fetch a tweet using an API method of your choice and mixin the entities tweet = t.statuses.show(id=TWEET_ID) tweet['entities'] = getEntities(tweet) print json.dumps(tweet, indent=4) Now, equipped with an overview of tweet entities and some of the interesting possibilities, let’s get to work harvesting and analyzing some tweets. Analyzing Tweets (One Entity at a Time) CouchDB makes a great storage medium for collecting tweets because, just like the email messages we looked at previously, they are conven‐ iently represented as JSON-based documents and lend themselves to map/reduce analysis with very little effort. Our next example script harvests tweets from time lines, is relatively robust, and should be easy to understand because all of the modules and much of the code has already been introduced in earlier chapters. One subtle consideration in reviewing it is that it uses a simple map/reduce job to compute the maximum ID value for a tweet and passes this in as a query constraint so as to avoid pulling duplicate data from Twitter’s API. See the in‐ formation associated with the since_id parameter of the time line APIs for more details. It may also be informative to note that the maximum number of most recent tweets available from the user time line is around 3,200, while the home time line2 returns around 800 statuses; thus, it’s not very expensive (in terms of counting toward your rate limit) to pull all of Analyzing Tweets (One Entity at a Time) | 75 3. See the May 2010 cover of Inc. magazine. the data that’s available. Perhaps not so intuitive when first interacting with the time line APIs is the fact that requests for data on the public time line only return 20 tweets, and those tweets are updated only every 60 seconds. To collect larger amounts of data you need to use the streaming API. For example, if you wanted to learn a little more about Tim O’Reilly, “Silicon Valley’s favorite smart guy,”3 you’d make sure that CouchDB is running and then invoke the script shown in Example 4-3, as follows: $ python the_tweet__harvest_timeline.py user 16 timoreil ly It’ll only take a few moments while approximately 3,200 tweets’ worth of interesting tidbits collect for your analytic pleasure. Example 4-3. Harvesting tweets from a user or public time line (the_tweet__harvest_timeline.py) # -*- coding: utf-8 -*- import sys import time import twitter import couchdb from couchdb.design import ViewDefinition from twitter__login import login from twitter__util import makeTwitterRequest def usage(): print 'Usage: $ %s timeline_name [max_pages] [user]' % (sys.argv[0], ) print print '\ttimeline_name in [public, home, user]' print '\t0 < max_pages <= 16 for timeline_name in [home, user]' print '\tmax_pages == 1 for timeline_name == public' print 'Notes:' print '\t* ~800 statuses are available from the home timeline.' print '\t* ~3200 statuses are available from the user time- line.' print '\t* The public timeline updates once every 60 secs and returns 20 statuses.' print '\t* See the streaming/search API for additional options to harvest tweets.' 76 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet exit() if len(sys.argv) < 2 or sys.argv[1] not in ('public', 'home', 'user'): usage() if len(sys.argv) > 2 and not sys.argv[2].isdigit(): usage() if len(sys.argv) > 3 and sys.argv[1] != 'user': usage() TIMELINE_NAME = sys.argv[1] MAX_PAGES = int(sys.argv[2]) USER = None KW = { # For the Twitter API call 'count': 200, 'skip_users': 'true', 'include_entities': 'true', 'since_id': 1, } if TIMELINE_NAME == 'user': USER = sys.argv[3] KW['id'] = USER # id or screen name if TIMELINE_NAME == 'home' and MAX_PAGES > 4: MAX_PAGES = 4 if TIMELINE_NAME == 'user' and MAX_PAGES > 16: MAX_PAGES = 16 if TIMELINE_NAME == 'public': MAX_PAGES = 1 t = login() # Establish a connection to a CouchDB database server = couchdb.Server('http://localhost:5984') DB = 'tweets-%s-timeline' % (TIMELINE_NAME, ) if USER: DB = '%s-%s' % (DB, USER) try: db = server.create(DB) except couchdb.http.PreconditionFailed, e: # Already exists, so append to it, keeping in mind that dupli- cates could occur db = server[DB] Analyzing Tweets (One Entity at a Time) | 77 # Try to avoid appending duplicate data into the system by on- ly retrieving tweets # newer than the ones already in the system. A trivial mapper/ reducer combination # allows us to pull out the max tweet id which guards against duplicates for the # home and user timelines. It has no effect for the public timeline def idMapper(doc): yield (None, doc['id']) def maxFindingReducer(keys, values, rereduce): return max(values) view = ViewDefinition('index', 'max_tweet_id', idMapper, max- FindingReducer, language='python') view.sync(db) KW['since_id'] = int([_id for _id in db.view('index/ max_tweet_id')][0].value) # Harvest tweets for the given timeline. # For friend and home timelines, the unofficial limitation is about 800 statuses although # other documentation may state otherwise. The public timeline on- ly returns 20 statuses # and gets updated every 60 seconds. # See http://groups.google.com/group/twitter-development-talk/ browse_thread/ # thread/4678df70c301be43 # Note that the count and since_id params have no effect for the public timeline page_num = 1 while page_num <= MAX_PAGES: KW['page'] = page_num api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline') tweets = makeTwitterRequest(t, api_call, **KW) db.update(tweets, all_or_nothing=True) print 'Fetched %i tweets' % len(tweets) page_num += 1 Given some basic infrastructure for collecting tweets, let’s start hack‐ ing to see what useful information we can discover. 78 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Tapping (Tim’s) Tweets This section investigates a few of the most common questions that come to mind from a simple rack and stack of entities mined out of Tim O’Reilly’s user time line. Although Tim has graciously agreed to being put under the microscope for educational purposes, you could easily repurpose these scripts for your own tweets or apply them to any other fascinating Twitterer. Or, you could begin with a reasonably large amount of public time line data that you’ve collected as your initial basis of exploration. A few interesting questions we’ll consider include: • How many of the user entities that appear most frequently in Tim’s tweets are also his friends? • What are the most frequently occurring entities that appear in Tim’s tweets? • Who does Tim retweet the most often? • How many of Tim’s tweets get retweeted? • How many of Tim’s tweets contain at least one entity? Like so many other situations involving a relatively unknown data source, one of the first things you can do to learn more about it is to count things in it. In the case of tweet data, counting the user mentions, hashtags, and URLs are great places to start. The next section gets the ball rolling by leveraging some basic map/reduction functionality to count tweet entities. Although tweet entities already exist in time line data, some of the code examples in the rest of this chapter assume that you might have gotten the data from elsewhere (search APIs, stream‐ ing APIs, etc.). Thus, we parse out the tweet entities to maintain good and consistent form. Relying on Twitter to extract your tweet entities for you is just a couple of code tweaks away. What entities are in Tim’s tweets? You can use CouchDB’s Futon to skim Tim’s user time line data one tweet at a time by browsing http://localhost:5984/_utils, but you’ll quickly find yourself unsatisfied, and you’ll want to get an overall summary of what Tim has been tweeting about, rather than a huge list of mentions of lots of specific things. Some preliminary answers are just a map/reduce job’s worth of effort away and can be easily computed with the script shown in Example 4-4. Analyzing Tweets (One Entity at a Time) | 79 Example 4-4. Extracting entities from tweets and performing simple frequency analysis (the_tweet__count_entities_in_tweets.py) # -*- coding: utf-8 -*- import sys import couchdb from couchdb.design import ViewDefinition from prettytable import PrettyTable DB = sys.argv[1] server = couchdb.Server('http://localhost:5984') db = server[DB] if len(sys.argv) > 2 and sys.argv[2].isdigit(): FREQ_THRESHOLD = int(sys.argv[2]) else: FREQ_THRESHOLD = 3 # Map entities in tweets to the docs that they appear in def entityCountMapper(doc): if not doc.get('entities'): import twitter_text def getEntities(tweet): # Now extract various entities from it and build up a familiar structure extractor = twitter_text.Extractor(tweet['text']) # Note that the production Twitter API contains a few additional fields in # the entities hash that would require additional API calls to resolve entities = {} entities['user_mentions'] = [] for um in extractor.extract_men- tioned_screen_names_with_indices(): entities['user_mentions'].append(um) entities['hashtags'] = [] for ht in extractor.extract_hashtags_with_indices(): # Massage field name to match production twitter api 80 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet ht['text'] = ht['hashtag'] del ht['hashtag'] entities['hashtags'].append(ht) entities['urls'] = [] for url in extractor.extract_urls_with_indices(): entities['urls'].append(url) return entities doc['entities'] = getEntities(doc) if doc['entities'].get('user_mentions'): for user_mention in doc['entities']['user_mentions']: yield ('@' + user_mention['screen_name'].lower(), [doc['_id'], doc['id']]) if doc['entities'].get('hashtags'): for hashtag in doc['entities']['hashtags']: yield ('#' + hashtag['text'], [doc['_id'], doc['id']]) if doc['entities'].get('urls'): for url in doc['entities']['urls']: yield (url['url'], [doc['_id'], doc['id']]) def summingReducer(keys, values, rereduce): if rereduce: return sum(values) else: return len(values) view = ViewDefinition('index', 'entity_count_by_doc', entityCount- Mapper, reduce_fun=summingReducer, language='python') view.sync(db) # Print out a nicely formatted table. Sorting by value in the cli- ent is cheap and easy # if you're dealing with hundreds or low thousands of tweets entities_freqs = sorted([(row.key, row.value) for row in db.view('index/entity_count_by_doc', group=True)], key=lambda x: x[1], reverse=True) fields = ['Entity', 'Count'] pt = PrettyTable(fields=fields) [pt.set_field_align(f, 'l') for f in fields] for (entity, freq) in entities_freqs: if freq > FREQ_THRESHOLD: Analyzing Tweets (One Entity at a Time) | 81 pt.add_row([entity, freq]) pt.printt() Note that while it could have been possible to build a less useful index to compute frequency counts without using the rereduce parameter, constructing a more useful index affords the opportunity to use rere duce, an important consideration for any nontrivial map/reduce job. A Note on rereduce Example 4-4 is the first listing we’ve looked at that explicitly makes use of the rereduce parameter, so it may be useful to explain exactly what’s going on there. Keep in mind that in addition to reducing the output for some number of mappers (which has necessarily been grouped by key), it’s also quite possible that the reducer may be passed the output for some number of reducers to rereduce. In common functions such as counting things and computing sums, it may not make a difference where the input has come from, so long as a com‐ mutative operation such as addition keeps getting repeatedly applied. In some cases, however, the difference between an initial reduction and a rereduction does matter. This is where the rereduce parameter becomes useful. For the summingReducer function shown in Example 4-4, consider the following sample outputs from the mapper (the actual contents of the value portion of each tuple are of no importance): [ ["@foo", [x1, x2]], ["@foo", [x3, x4]], ["@bar", [x5, x6]], ["@foo", [x7, x8]], ("@bar", [x9, x10]] ] For simplicity, let’s suppose that each node in the underlying B-tree that stores these tuples is only capable of housing two items at a time (the actual value as implemented for CouchDB is well into the thou‐ sands). The reducer would conceptually operate on the following in‐ put during the first pass, when rereduce is false. Again, recall that the values are grouped by key: # len( [ [x1, x2], [x3, x4] ] ) == 2 summingReducer(["@foo", "@foo"], [ [x1, x2], [x3, x4] ], False) # len( [ [x7, x8] ] ) => 1 summingReducer(["@foo"], [ [x7, x8] ], False) 82 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet # len( [ [x5, x6], [x9, x10] ] ) == 2 summingReducer(["@bar", "@bar"], [ [x5, x6], [x9, x10] ], False) During the next pass through the reducer, when rereduce is True, because the reducer is operating on already reduced output that has necessarily already been grouped by key, no comparison related to key values is necessary. Given that the previous len operation effectively did nothing more than count the number of occurrences of each key (a tweet entity), additional passes through the reducer now need to sum these values to compute a final tally, as is illustrated by the rere‐ duce phase: # values from previously reducing @foo keys: sum([2, 1]) == 3 summingReducer(None, [ 2, 1 ], True) # values from previously reducing @bar keys: sum([2]) == 2 summingReducer(None, [ 2 ], True) The big picture is that len was first used to count the number of times an entity appears, as an intermediate step; then, in a final rereduction step, the sum function tallied those intermediate results. In short, the script uses a mapper to emit a tuple of the form (entity, \[couchdb_id, tweet_id]) for each document and then uses a re‐ ducer to count the number of times each distinct entity appears. Given that you’re probably working with a relatively small collection of items and that you’ve been introduced to some other mechanisms for sorting previously, you then simply sort the data on the client-side and apply a frequency threshold. Example output with a threshold of 15 is shown in Table 4-1 but also displayed as a chart in Figure 4-1 so that you have a feel for the underlying distribution. Table 4-1. Entities sorted by frequency from harvested tweets by @tim‐ oreilly Entity Frequency #gov20 140 @OReillyMedia 124 #Ebook 89 @timoreilly 77 #ebooks 55 @slashdot 45 Analyzing Tweets (One Entity at a Time) | 83 Entity Frequency @jamesoreilly 41 #w2e 40 @gnat 38 @n2vip 37 @monkchips 33 #w2s 31 @pahlkadot 30 @dalepd 28 #g2e 27 #ebook 25 @ahier 24 #where20 22 @digiphile 21 @fredwilson 20 @brady 19 @mikeloukides 19 #pdf10 19 @nytimes 18 #fooeast 18 @andrewsavikas 17 @CodeforAmerica 16 @make 16 @pkedrosky 16 @carlmalamud 15 #make 15 #opengov 15 84 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Figure 4-1. The frequency of entities that have been retweeted by @timoreilly for a sample of recent tweets So, what’s on Tim’s mind these days? It’s no surprise that a few of his favorite topics appear in the list—for example, #gov20 with a whop‐ ping 140 mentions, dwarfing everything else—but what may be even more intriguing is to consider some of the less obvious user mentions, which likely indicate close relationships. In fact, it may be fair to as‐ sume that Tim finds the users he mentions often interesting. You might even go as far as to infer that he is influenced by or even trusts these other users. (Many an interesting algorithm could be devised to try to determine these types of relationships, with variable degrees of cer‐ tainty.) Although computing raw counts as we’ve done in this section is interesting, applying time-based filtering is another tempting pos‐ sibility. For example, we might gain useful insight by applying a time- based filter to the map/reduce code in Example 4-4 so that we can calculate what Tim has been talking about in the past N days as op‐ posed to over his past ~3,200 tweets, which span a nontrivial period of time. (Some Twitterers haven’t even come close to tweeting 3,200 times over the course of several years.) It also wouldn’t be very difficult to plot out tweets on the SIMILE Timeline that was introduced pre‐ viously and browse them to get a quick gist of what Tim has been tweeting about most recently. Do frequently appearing user entities imply friendship? The previous chapter provided a listing that demonstrated how to harvest a Twitterer’s friends and followers, and used Redis to store the results. Assuming you’ve already fetched Tim’s friends and followers Analyzing Tweets (One Entity at a Time) | 85 with that listing, the results are readily available to you in Redis, and it’s a fairly simple matter to compute how many of the N most fre‐ quently tweeted user entities are also Tim’s friends. Example 4-5 illus‐ trates the process of using information we already have available in Redis to resolve screen names from user IDs; Redis’ in-memory set operations are used to compute which of the most frequently appear‐ ing user entities in Tim’s user time line are also his friends. Example 4-5. Finding @mention tweet entities that are also friends (the_tweet__how_many_user_entities_are_friends.py) # -*- coding: utf-8 -*- import json import redis import couchdb import sys from twitter__util import getRedisIdByScreenName from twitter__util import getRedisIdByUserId SCREEN_NAME = sys.argv[1] THRESHOLD = int(sys.argv[2]) # Connect using default settings for localhost r = redis.Redis() # Compute screen_names for friends friend_ids = r.smembers(getRedisIdByScreenName(SCREEN_NAME, 'friend_ids')) friend_screen_names = [] for friend_id in friend_ids: try: friend_screen_names.append(json.loads(r.get(getRedisIdByU- serId(friend_id, 'info.json'))) ['screen_name'].lower()) except TypeError, e: continue # not locally available in Redis - look it up or skip it # Pull the list of (entity, frequency) tuples from CouchDB server = couchdb.Server('http://localhost:5984') db = server['tweets-user-timeline-' + SCREEN_NAME] entities_freqs = sorted([(row.key, row.value) for row in db.view('index/entity_count_by_doc', group=True)], 86 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet key=lambda x: x[1]) # Keep only user entities with insufficient frequencies user_entities = [(ef[0])[1:] for ef in entities_freqs if ef[0][0] == '@' and ef[1] >= THRESHOLD] # Do a set comparison entities_who_are_friends = \ set(user_entities).intersection(set(friend_screen_names)) entities_who_are_not_friends = \ set(user_entities).difference(entities_who_are_friends) print 'Number of user entities in tweets: %s' % (len(user_enti- ties), ) print 'Number of user entities in tweets who are friends: %s' \ % (len(entities_who_are_friends), ) for e in entities_who_are_friends: print '\t' + e print 'Number of user entities in tweets who are not friends: %s' \ % (len(entities_who_are_not_friends), ) for e in entities_who_are_not_friends: print '\t' + e The output with a frequency threshold of 15 (shown in Example 4-6) is predictable, yet it brings to light a couple of observations. Example 4-6. Sample output from Example 4-5 displaying @mention tweet entities that are also friends of @timoreilly Number of user entities in tweets: 20 Number of user entities in tweets who are friends: 18 ahier pkedrosky CodeforAmerica nytimes brady carlmalamud pahlkadot make jamesoreilly andrewsavikas gnat slashdot OReillyMedia dalepd mikeloukides monkchips Analyzing Tweets (One Entity at a Time) | 87 fredwilson digiphile Number of user entities in tweets who are not friends: 2 n2vip timoreilly All in all, there were 20 user entities who exceeded a frequency thresh‐ old of 15, and 18 of those turned out to be friends. Given that most of the people who appear in his tweets are also his friends, it’s probably safe to say that there’s a strong trust relationship of some kind between Tim and these individuals. Take a moment to compare this list to the results from our exercises previously. What might be just as interest‐ ing, however, is noting that Tim himself appears as one of his most frequently tweeted-about entities, as does one other individual, @n2vip. Looking more closely at the context of the tweets involving @n2vip could be useful. Any theories on how so many user mentions could be in someone’s tweet stream without them being a friend? Let’s find out. From the work you’ve done previously, you already know how quick and easy it can be to apply Lucene’s full-text indexing capabilities to a CouchDB database, and a minimal adaptation and extension is all that it takes to quickly hone in on tweets mentioning @n2vip. Example 4-7 demonstrates how it’s done. Example 4-7. Using couchdb-lucene to query tweet data (the_tweet__couchdb_lucene.py) # -*- coding: utf-8 -*- import sys import httplib from urllib import quote import json import couchdb DB = sys.argv[1] QUERY = sys.argv[2] # The body of a JavaScript-based design document we'll create dd = \ {'fulltext': {'by_text': {'index': '''function(doc) { var ret=new Document(); ret.add(doc.text); return ret }'''}}} 88 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet try: server = couchdb.Server('http://localhost:5984') db = server[DB] except couchdb.http.ResourceNotFound, e: print """CouchDB database '%s' not found. Please check that the database exists and try again.""" % DB sys.exit(1) try: conn = httplib.HTTPConnection('localhost', 5984) conn.request('GET', '/%s/_design/lucene' % (DB, )) response = conn.getresponse() finally: conn.close() # If the design document did not exist create one that'll be # identified as "_design/lucene". The equivalent of the following # in a terminal: # $ curl -X PUT http://localhost:5984/DB/_design/lucene -d @dd.json if response.status == 404: try: conn = httplib.HTTPConnection('localhost', 5984) conn.request('PUT', '/%s/_design/lucene' % (DB, ), json.dumps(dd)) response = conn.getresponse() if response.status != 201: print 'Unable to create design document: %s %s' % (re- sponse.status, response.reason) sys.exit(1) finally: conn.close() # Querying the design document is nearly the same as usual except that you reference # couchdb-lucene's _fti HTTP handler # $ curl http://localhost:5984/DB/_fti/_design/lucene/by_subject? q=QUERY try: conn.request('GET', '/%s/_fti/_design/lucene/by_text?q=%s' % (DB, quote(QUERY))) response = conn.getresponse() if response.status == 200: response_body = json.loads(response.read()) else: print 'An error occurred fetching the response: %s %s' \ % (response.status, response.reason) Analyzing Tweets (One Entity at a Time) | 89 print 'Make sure your couchdb-lucene server is running.' sys.exit(1) finally: conn.close() doc_ids = [row['id'] for row in response_body['rows']] # pull the tweets from CouchDB and extract the text for display tweets = [db.get(doc_id)['text'] for doc_id in doc_ids] for tweet in tweets: print tweet print Abbreviated output from the script, shown in Example 4-8, reveals that @n2vip appears in so many of @timoreilly’s tweets because the two were engaging in Twitter conversations. Example 4-8. Sample output from Example 4-7 @n2vip Thanks. Great stuff. Passing on to the ebook team. @n2vip I suggested it myself the other day before reading this note. RT @n2vip Check this out if you really want to get your '#Church- ill on', a ... @n2vip Remember a revolution that began with a handful of farmers and tradesmen ... @n2vip Good suggestion re having free sample chapters as ebooks, not just pdfs... @n2vip I got those statistics by picking my name off the influenc- er list in ... RT @n2vip An informative, non-partisan FAQ regarding Health Care Reform at ... @n2vip Don't know anyone who is advocating that. FWIW, it was Rs who turned ... @n2vip No, I don't. But a lot of the people arguing against renew- ables seem ... @n2vip You've obviously never read an ebook on the iPhone. It's a great reading ... @n2vip I wasn't suggesting that insurance was the strange world, just that you ... @n2vip In high tech, there is competition from immigrant workers. Yet these two ... @n2vip How right you are. We really don't do a good job teaching people ... @n2vip The climategate stuff is indeed disturbing. But I still hold by what ... @n2vip FWIW, I usually do follow links, so do include them if ap- propriate. Thanks. @n2vip I don't mind substantive disagreement - e.g. with pointers to real info ... 90 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet @n2vip Totally agree that ownership can help. But you need to un- derstand why ... @n2vip Maybe not completely extinct, but certainly economically ex- tinct. E.g. ... @n2vip I wasn't aware that it was part of a partisan agenda. Too bad, because ... RT @n2vip if only interesed in his 'Finest Hour' speech, try this - a ... @n2vip They matter a lot. I was also struck by that story this morning. Oil ... @n2vip I understand that. I guess "don't rob MYsocialized medicine to fund ... RT @n2vip Electronic medical record efforts pushing private prac- tice docs to ... @n2vip I think cubesail can be deployed in 2 ways: to force quick- er re-entry ... RT @ggreenwald Wolf Blitzer has major epiphany on public opinion and HCR that ... Splicing in the other half of the conversation It appears that @n2vip’s many mentions are related to the fact that he engaged Tim in at least a couple of discussions, and from a quick skim of the comments, the discussions appear somewhat political in nature (which isn’t terribly surprising given the high frequency of the #gov20 hashtag, noted in an exercise earlier in this chapter). An interesting follow-on exercise is to augment Example 4-7 to extract the in_reply_ +to_status_id+ fields of the tweets and reassemble a more complete and readable version of the conversations. One technique for reconstructing the thread that minimizes the num‐ ber of API calls required is to simply fetch all of @n2vip’s tweets that have tweet IDs greater than the minimum tweet ID in the thread of interest, as opposed to collecting individual status IDs one by one. Example 4-9 shows how this could be done. Example 4-9. Reconstructing tweet discussion threads (the_tweet__re‐ assemble_discussion_thread.py) # -*- coding: utf-8 -*- import sys import httplib from urllib import quote import json import couchdb from twitter__login import login from twitter__util import makeTwitterRequest Analyzing Tweets (One Entity at a Time) | 91 DB = sys.argv[1] USER = sys.argv[2] try: server = couchdb.Server('http://localhost:5984') db = server[DB] except couchdb.http.ResourceNotFound, e: print >> sys.stderr, """CouchDB database '%s' not found. Please check that the database exists and try again.""" % DB sys.exit(1) # query by term try: conn = httplib.HTTPConnection('localhost', 5984) conn.request('GET', '/%s/_fti/_design/lucene/by_text?q=%s' % (DB, quote(USER))) response = conn.getresponse() if response.status == 200: response_body = json.loads(response.read()) else: print >> sys.stderr, 'An error occurred fetching the re- sponse: %s %s' \ % (response.status, response.reason) sys.exit(1) finally: conn.close() doc_ids = [row['id'] for row in response_body['rows']] # pull the tweets from CouchDB tweets = [db.get(doc_id) for doc_id in doc_ids] # mine out the in_reply_to_status_id fields and fetch those tweets as a batch request conversation = sorted([(tweet['_id'], int(tweet['in_reply_to_sta- tus_id'])) for tweet in tweets if tweet['in_re- ply_to_status_id'] is not None], key=lambda x: x[1]) min_conversation_id = min([int(i[1]) for i in conversation if i[1] is not None]) max_conversation_id = max([int(i[1]) for i in conversation if i[1] is not None]) # Pull tweets from other user using user timeline API to minimize API expenses... 92 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet t = login() reply_tweets = [] results = [] page = 1 while True: results = makeTwitterRequest(t, t.statuses.user_timeline, count=200, # Per , some # caveats apply with the oldest id you can fetch using "since_id" since_id=min_conversation_id, max_id=max_conversation_id, skip_users='true', screen_name=USER, page=page) reply_tweets += results page += 1 if len(results) == 0: break # During testing, it was observed that some tweets may not resolve or possibly # even come back with null id values -- possibly a temporary fluke. Workaround. missing_tweets = [] for (doc_id, in_reply_to_id) in conversation: try: print [rt for rt in reply_tweets if rt['id'] == in_re- ply_to_id][0]['text'] except Exception, e: print >> sys.stderr, 'Refetching <>' % (in_re- ply_to_id, ) results = makeTwitterRequest(t, t.statuses.show, id=in_re- ply_to_id) print results['text'] # These tweets are already on hand print db.get(doc_id)['text'] print A lot of this code should look familiar by now. With the ability to authenticate into Twitter’s API, store and retrieve data locally, and fetch remote data from Twitter, a lot can be done with a minimal amount of “business logic.” Abbreviated sample output from this script, presented in Example 4-10, follows and shows the flow of dis‐ cussion between @timoreilly and @n2vip. Analyzing Tweets (One Entity at a Time) | 93 Example 4-10. Sample output from Example 4-9 Question: If all Ins. Co. suddenly became non-profit and approved ALL Dr. ... @n2vip Don't know anyone who is advocating that. FWIW, it was Rs who turned ... @timoreilly RT @ggreenwald Wolf Blitzer has major epiphany on pub- lic opinion ... RT @ggreenwald Wolf Blitzer has major epiphany on public opinion and HCR that ... @timoreilly RE: Cubesail - I don't get it, does the sail collect loose trash ... @n2vip I think cubesail can be deployed in 2 ways: to force quick- er re-entry ... @timoreilly How are you finding % of your RT have links? What ser- vice did you ... @n2vip I got those statistics by picking my name off the influenc- er list in ... @timoreilly a more fleshed-out e-book 'teaser' chapter idea here: http://bit.ly/aML6eH @n2vip Thanks. Great stuff. Passing on to the ebook team. @timoreilly Tim, #HCR law cuts Medicare payments to fund, in part, broader ... @n2vip I understand that. I guess "don't rob MYsocialized medicine to fund ... @timoreilly RE: Auto Immune - a "revolution" that is measured by a hand-full ... @n2vip Remember a revolution that began with a handful of farmers and tradesmen ... Do oil spills in Africa not matter? http://bit.ly/bKqv01 @jaketap- per @yunjid ... @n2vip They matter a lot. I was also struck by that story this morning. Oil ... Who Does Tim Retweet Most Often? Given the (not so) old adage that a retweet is the highest form of a compliment, another way of posing the question, “Who does Tim re‐ tweet most often?” is to ask, “Who does Tim compliment most often?” Or, because it wouldn’t really make sense for him to retweet content he did not find interesting, we might ask, “Who does Tim think is talking about stuff that matters?” A reasonable hypothesis is that many 94 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet of the user entities that appear in Tim’s public time line may be refer‐ ences to the authors of tweets that Tim is retweeting. Let’s explore this idea further and calculate how many of Tim’s tweets are retweets and, of those retweets, which tweet author gets retweeted the most often. There are a number of tweet resource API methods available for col‐ lecting retweets, but given that we already have more than a few thou‐ sand tweets on hand from an earlier exercise, let’s instead analyze those as a means of getting the maximum value out of our API calls. There are a couple of basic patterns for a retweet: * RT @user Mary had a little lamb • Mary had a little lamb (via @user) In either case, the meaning is simple: someone is giving @user credit for a particular bit of information. Example 4-11 provides a sample program that extracts the number of times Tim has retweeted other Twitterers by applying a regular expression matcher as part of a simple map/reduce routine. Be advised that Twitter’s API has been evolving quickly over the course of time this book was being written. Example 4-11 counts retweets by extracting clues such as “RT” from the tweet text itself. By the time you’re reading this, there may very well be a more efficient way to compute the number of times one user has retweeted another user for some duration. Example 4-11. Counting the number of times Twitterers have been re‐ tweeted by someone (the_tweet__count_retweets_of_other_users.py) # -*- coding: utf-8 -*- import sys import couchdb from couchdb.design import ViewDefinition from prettytable import PrettyTable DB = sys.argv[1] try: server = couchdb.Server('http://localhost:5984') db = server[DB] except couchdb.http.ResourceNotFound, e: print """CouchDB database '%s' not found. Please check that the database exists and try again.""" % DB sys.exit(1) Analyzing Tweets (One Entity at a Time) | 95 if len(sys.argv) > 2 and sys.argv[2].isdigit(): FREQ_THRESHOLD = int(sys.argv[2]) else: FREQ_THRESHOLD = 3 # Map entities in tweets to the docs that they appear in def entityCountMapper(doc): if doc.get('text'): import re m = re.search(r"(RT|via)((?:\b\W*@\w+)+)", doc['text']) if m: entities = m.groups()[1].split() for entity in entities: yield (entity.lower(), [doc['_id'], doc['id']]) else: yield ('@', [doc['_id'], doc['id']]) def summingReducer(keys, values, rereduce): if rereduce: return sum(values) else: return len(values) view = ViewDefinition('index', 'retweet_entity_count_by_doc', enti- tyCountMapper, reduce_fun=summingReducer, language='python') view.sync(db) # Sorting by value in the client is cheap and easy # if you're dealing with hundreds or low thousands of tweets entities_freqs = sorted([(row.key, row.value) for row in db.view('index/retweet_enti- ty_count_by_doc', group=True)], key=lambda x: x[1], re- verse=True) fields = ['Entity', 'Count'] pt = PrettyTable(fields=fields) [pt.set_field_align(f, 'l') for f in fields] for (entity, freq) in entities_freqs: if freq > FREQ_THRESHOLD and entity != '@': pt.add_row([entity, freq]) pt.printt() If you think the results will look almost identical to the raw entity counts computed earlier in Table 4-1, you’ll be somewhat surprised. 96 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet The listing in [entities-in-retweets] also uses a threshold of 15, to jux‐ tapose the difference. .Most frequent entities appearing in retweets by @timoreilly; additional columns to illustrate normalization of retweet counts by @timoreilly Entity Number of times @user_retweeted by _@timoreilly Total tweets ever by @user Normalized retweet score @monkchips 30 33215 0.000903206 @ahier 18 14849 0.001212203 @slashdot 41 22081 0.0018568 @gnat 28 11322 0.002473061 @mikeloukides 15 2926 0.005126452 @pahlkadot 16 3109 0.005146349 @oreillymedia 97 6623 0.014645931 @jamesoreilly 34 4439 0.007659383 @dalepd 16 1589 So, who does Tim retweet/compliment the most? Well, not so sur‐ prisingly, his company and folks closely associated with his company compose the bulk of the list, with @oreillymedia coming in at the top of the pack. A visual inspection of retweet tallies and raw entity tallies shows an obvious correlation. Keep in mind, however, that the results shown in the “Number of times @user retweeted by @timoreilly” col‐ umn in [entities-in-retweets] are not normalized. For example, @da‐ lepd was retweeted far fewer times than @oreillymedia, but a brief in‐ spection of the statuses_count field that’s embedded in the user objects in tweets (among other places in the API results) reveals that @da‐ lepd tweets far less frequently than @oreillymedia. In fact, if you nor‐ malize the data by dividing the number of retweets by the total number of tweets for each user, @dalepd just barely ranks second to @oreilly‐ media—meaning that Tim retweets more of his tweets than those of any other user in the table except for @oreillymedia, even though it doesn’t necessarily appear that way if you sort by raw retweet frequen‐ cy. Figure 4-2 illustrates this analysis as a bubble chart. Additional tinkering to determine why some folks aren’t retweeted as frequently as they are mentioned would also be interesting. Analyzing Tweets (One Entity at a Time) | 97 Figure 4-2. The y-axis of this bubble chart depicts the raw number of times a user has been retweeted; the area of the bubble represents a normalized score for how many times the user has been retweeted compared to the total number of times he has ever tweeted Given that it’s not too difficult to determine who Tim retweets most often, what about asking the question the other way: who retweets Tim most often? Answering this question in aggregate without a focused target population would be a bit difficult, if not impossible given the Twitter API limitations, as Tim has ~1,500,000 followers. However, it’s a bit more tractable if you narrow it down to a reasonably small target population. The next section investigates some options. What’s Tim’s Influence? Asking how many of your tweets get retweeted is really just another way of measuring your own influence. If you tweet a lot and nobody retweets you, it’s safe to say that your influence is pretty weak—at least as a Twitterer. In fact, it would be somewhat of a paradox to find your‐ self having the good fortune of many followers but not many retweets, because you generally get followers and retweets for the very same reason: you’re interesting and influential! One base metric that’s quite simple and inexpensive to calculate is the ratio of tweets to retweets. A ratio of 1 would mean that every single tweet you’ve authored was retweeted and indicate that your influence is strong—potentially having second- and third-level effects that reach 98 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet millions of unique users (literally)—while values closer to 0 show weaker influence. Of course, given the nature of Twitter, it’s highly unlikely than any human user would have a tweet-to-retweet ratio of 1, if for no other reason than the mechanics of conversation (@replies) would drive the ratio downward. Twitter exposes the statuses/ retweets_of_me resource, which provides the authenticating user with insight on which of her tweets have been retweeted. However, we don’t have access to that API to analyze Tim’s retweets, so we need to look for another outlet. Example 4-12 takes advantage of the retweet_count field in a tweet to compute the number of tweets that have been retweeted a given number of times as part of what should now seem like a trivial map/ reduce combination. Sample output formatted into a chart follows the example. Example 4-12. Finding the tweets that have been retweeted most often (the_tweet__count_retweets_by_others.py) # -*- coding: utf-8 -*- import sys import couchdb from couchdb.design import ViewDefinition from prettytable import PrettyTable from twitter__util import pp DB = sys.argv[1] try: server = couchdb.Server('http://localhost:5984') db = server[DB] except couchdb.http.ResourceNotFound, e: print """CouchDB database '%s' not found. Please check that the database exists and try again.""" % DB sys.exit(1) # Map entities in tweets to the docs that they appear in def retweetCountMapper(doc): if doc.get('id') and doc.get('text'): yield (doc['retweet_count'], 1) def summingReducer(keys, values, rereduce): return sum(values) view = ViewDefinition('index', 'retweets_by_id', retweetCountMap- per, Analyzing Tweets (One Entity at a Time) | 99 4. Note that as of late December 2010, the retweet_count field maxes out at 100. For this particular batch of data, there were 2 tweets that had been retweeted exactly 100 times, and 59 tweets that were retweeted "100+". reduce_fun=summingReducer, language='python') view.sync(db) fields = ['Num Tweets', 'Retweet Count'] pt = PrettyTable(fields=fields) [pt.set_field_align(f, 'l') for f in fields] retweet_total, num_tweets, num_zero_retweets = 0, 0, 0 for (k,v) in sorted([(row.key, row.value) for row in db.view('index/retweets_by_id', group=True) if row.key is not None], key=lambda x: x[0], reverse=True): pt.add_row([k, v]) if k == "100+": retweet_total += 100*v elif k == 0: num_zero_retweets += v else: retweet_total += k*v num_tweets += v pt.printt() print '\n%s of %s authored tweets were retweeted at least once' % \ (pp(num_tweets - num_zero_retweets), pp(num_tweets),) print '\t(%s tweet/retweet ratio)\n' % \ (1.0*(num_tweets - num_zero_retweets)/num_tweets,) print 'Those %s authored tweets generated %s retweets' % (pp(num_tweets), pp(retweet_total),) Figure 4-3 displays sample results from Example 4-12 that are for‐ matted into a more compact chart that uses a logarithmic scale to squash the y-axis. Values along the x-axis correspond to the number of tweets with a given retweet value denoted by the y-axis. The total “area under the curve” is just over 3,000—the total number of tweets being analyzed. For example, just over 533 of Tim’s tweets weren’t re‐ tweeted at all as denoted by the far left column, 50 of his tweets were retweeted 50 times, and over 60 of his tweets were retweeted over 100 times4 as denoted by the far right column. 100 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Figure 4-3. Sample results from Example 4-12 The distribution isn’t too surprising in that it generally trends accord‐ ing to the power law and that there are a fairly high number of tweets that went viral and were retweeted what could have been many hun‐ dreds of times. The high-level takeaways are that of over 3,000 total tweets, 2,536 of them were retweeted at least one time (a ratio of about 0.80) and generated over 50,000 retweets in all (a factor about 16). To say the very least, the numbers confirm Tim’s status as an influential information maven. How Many of Tim’s Tweets Contain Hashtags? It seems a reasonable theory that tweets that contain hashtag entities are inherently more valuable than ones that don’t because someone has deliberately gone to the trouble of embedding aggregatable infor‐ mation into those tweets, which essentially transforms them into semi- structured information. For example, it seems reasonable to assume that someone who averages 2+ hashtags per tweet is very interested in Analyzing Tweets (One Entity at a Time) | 101 bridging knowledge and aware of the power of information, whereas someone who averages 0.1 hashtags per tweet probably is less so. What’s a Folksonomy? A fundamental aspect of human intelligence is the desire to classify things and derive a hierarchy in which each element “belongs to” or is a “child” of a parent element one level higher in the hierarchy. Leav‐ ing aside philosophical debates about the difference between a taxon‐ omy and an ontology, a taxonomy is essentially a hierarchical structure that classifies elements into parent/child bins. The term folksonomy was coined around 2004 as a means of describing the universe of collaborative tagging and social indexing efforts that emerge in var‐ ious ecosystems of the Web, and it’s a play on words in the sense that it blends “folk” and “taxonomy.” So, in essence, a folksonomy is just a fancy way of describing the decentralized universe of tags that emerg‐ es as a mechanism of collective intelligence when you allow people to classify content with labels. Computing the average number of hashtags per tweet should be a cake- walk for you by now. We’ll recycle some code and compute the total number of hashtags in one map/reduce phase, compute the total num‐ ber of tweets in another map/reduce phase, and then divide the two numbers, as illustrated in Example 4-13. Example 4-13. Counting hashtag entities in tweets (the_tweet__avg_hashtags_per_tweet.py) # -*- coding: utf-8 -*- import sys import couchdb from couchdb.design import ViewDefinition DB = sys.argv[1] try: server = couchdb.Server('http://localhost:5984') db = server[DB] except couchdb.http.ResourceNotFound, e: print """CouchDB database '%s' not found. Please check that the database exists and try again.""" % DB sys.exit(1) # Emit the number of hashtags in a document 102 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet def entityCountMapper(doc): if not doc.get('entities'): import twitter_text def getEntities(tweet): # Now extract various entities from it and build up a familiar structure extractor = twitter_text.Extractor(tweet['text']) # Note that the production Twitter API contains a few additional fields in # the entities hash that would require additional API calls to resolve entities = {} entities['user_mentions'] = [] for um in extractor.extract_men- tioned_screen_names_with_indices(): entities['user_mentions'].append(um) entities['hashtags'] = [] for ht in extractor.extract_hashtags_with_indices(): # Massage field name to match production twitter api ht['text'] = ht['hashtag'] del ht['hashtag'] entities['hashtags'].append(ht) entities['urls'] = [] for url in extractor.extract_urls_with_indices(): entities['urls'].append(url) return entities doc['entities'] = getEntities(doc) if doc['entities'].get('hashtags'): yield (None, len(doc['entities']['hashtags'])) def summingReducer(keys, values, rereduce): return sum(values) view = ViewDefinition('index', 'count_hashtags', entityCountMapper, reduce_fun=summingReducer, language='python') Analyzing Tweets (One Entity at a Time) | 103 view.sync(db) num_hashtags = [row for row in db.view('index/count_hashtags')] [0].value # Now, count the total number of tweets that aren't direct replies def entityCountMapper(doc): if doc.get('text')[0] == '@': yield (None, 0) else: yield (None, 1) view = ViewDefinition('index', 'num_docs', entityCountMapper, reduce_fun=summingReducer, language='python') view.sync(db) num_docs = [row for row in db.view('index/num_docs')][0].value # Finally, compute the average print 'Avg number of hashtags per tweet for %s: %s' % \ (DB.split('-')[-1], 1.0 * num_hashtags / num_docs,) For a recent batch we fetched earlier, running this script reveals that Tim averages about 0.5 hashtags per tweet that is not a direct reply to someone. In other words, he includes a hashtag in about half of his tweets. For anyone who regularly tweets, including a hashtag that much of the time provides a substantial contribution to the overall Twitter search index and the ever-evolving folksonomy. As a follow- up exercise, it could be interesting to compute the average number of hyperlink entities per tweet, or even go so far as to follow the links and try to discover new information about Tim’s interests by inspecting the title or content of the linked web pages. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) One of the most fascinating aspects of data mining is that it affords you the ability to discover new knowledge from existing information. There really is something to be said for the old adage that “knowledge is power,” and it’s especially true in an age where the amount of infor‐ mation available is steadily growing with no indication of decline. As an interesting exercise, let’s see what we can discover about some of the latent social networks that exist in the sea of Twitter data. The basic 104 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet 5. Steven D. Levitt is the co-author of Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (Harper), a book that systematically uses data to answer seemingly radical questions such as, “What do school teachers and sumo wrestlers have in common?” 6. This question was partly inspired by the interesting Radar post, “Data science demo‐ cratized”, which mentions a presentation that investigated the same question. approach we’ll take is to collect some focused data on two or more topics in a specific way by searching on a particular hashtag, and then apply some of the same metrics we coded up in the previous section (where we analyzed Tim’s tweets) to get a feel for the similarity between the networks. Since there’s no such thing as a “stupid question,” let’s move forward in the spirit of famed economist Steven D. Levitt5 and ask the question, “What do #TeaParty and #JustinBieber have in common?”6 Example 4-14 provides a simple mechanism for collecting approxi‐ mately the most recent 1,500 tweets (the maximum currently returned by the search API) on a particular topic and storing them away in CouchDB. Like other listings you’ve seen earlier in this chapter, it in‐ cludes simple map/reduce logic to incrementally update the tweets in the event that you’d like to run it over a longer period of time to collect a larger batch of data than the search API can give you in a short duration. You might want to investigate the streaming API for this type of task. Example 4-14. Harvesting tweets for a given query (the_tweet__search.py) # -*- coding: utf-8 -*- import sys import twitter import couchdb from couchdb.design import ViewDefinition from twitter__util import makeTwitterRequest SEARCH_TERM = sys.argv[1] MAX_PAGES = 15 KW = { 'domain': 'search.twitter.com', 'count': 200, 'rpp': 100, 'q': SEARCH_TERM, } Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 105 server = couchdb.Server('http://localhost:5984') DB = 'search-%s' % (SEARCH_TERM.lower().replace('#', '').re- place('@', ''), ) try: db = server.create(DB) except couchdb.http.PreconditionFailed, e: # already exists, so append to it, and be mindful of duplicates db = server[DB] t = twitter.Twitter(domain='search.twitter.com') for page in range(1, 16): KW['page'] = page tweets = makeTwitterRequest(t, t.search, **KW) db.update(tweets['results'], all_or_nothing=True) if len(tweets['results']) == 0: break print 'Fetched %i tweets' % len(tweets['results']) The following sections are based on approximately 3,000 tweets per topic and assume that you’ve run the script to collect data on #TeaParty and #JustinBieber (or any other topics that interest you). Depending on your terminal preferences, you may need to escape certain characters (such as the hash symbol) because of the way they might be interpreted by your shell. For example, in Bash, you’d need to escape a hashtag query for #TeaParty as \#TeaParty to ensure that the shell interprets the hash symbol as part of the query term, instead of as the beginning of a comment. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets? One of the simplest yet probably most effective ways to characterize two different crowds is to examine the entities that appear in an ag‐ gregate pool of tweets. In addition to giving you a good idea of the other topics that each crowd is talking about, you can compare the entities that do co-occur to arrive at a very rudimentary similarity metric. Example 4-4 already provides the logic we need to perform a first pass at entity analysis. Assuming you’ve run search queries for #JustinBieber and #TeaParty, you should have two CouchDB databases called “search-justinbieber” and “search-teaparty” that you can pass 106 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet in to produce your own results. Sample results for each hashtag with an entity frequency greater than 20 follow in Tables Table 4-2 and Table 4-3; Figure 4-4 displays a chart conveying the underlying fre‐ quency distributions for these tables. Because the y-axis contains such extremes, it is adjusted to be a logarithmic scale, which makes the y values easier to read. Table 4-2. Most frequent entities appearing in tweets containing #Tea‐ Party Entity Frequency #teaparty 2834 #tcot 2426 #p2 911 #tlot 781 #gop 739 #ocra 649 #sgp 567 #twisters 269 #dnc 175 #tpp 170 #GOP 150 #iamthemob 123 #ucot 120 #libertarian 112 #obama 112 #vote2010 109 #TeaParty 106 #hhrs 104 #politics 100 #immigration 97 #cspj 96 #acon 91 #dems 82 #palin 79 #topprog 78 #Obama 74 #tweetcongress 72 Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 107 Entity Frequency #jcot 71 #Teaparty 62 #rs 60 #oilspill 59 #news 58 #glennbeck 54 #FF 47 #liberty 47 @welshman007 45 #spwbt 44 #TCOT 43 http://tinyurl.com/24h36zq 43 #rnc 42 #military 40 #palin12 40 @Drudge_Report 39 @ALIPAC 35 #majority 35 #NoAmnesty 35 #patriottweets 35 @ResistTyranny 34 #tsot 34 http://tinyurl.com/386k5hh 31 #conservative 30 #AZ 29 #TopProg 29 @JIDF 28 @STOPOBAMA2012 28 @TheFlaCracker 28 #palin2012 28 @thenewdeal 27 #AFIRE 27 #Dems 27 #asamom 26 #GOPDeficit 25 108 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Entity Frequency #wethepeople 25 @andilinks 24 @RonPaulNews 24 #ampats 24 #cnn 24 #jews 24 @First_Patriots 23 #patriot 23 #pjtv 23 @Liliaep 22 #nvsen 22 @BrnEyeSuss 21 @crispix49 21 @koopersmith 21 @Kriskxx 21 #Kagan 21 @blogging_tories 20 #cdnpoli 20 #fail 20 #nra 20 #roft 20 Table 4-3. Most frequent entities appearing in tweets containing #Jus‐ tinBieber Entity Frequency #justinbieber 1613 #JustinBieber 1379 @lojadoaltivo 354 @ProSieben 258 #JUSTINBIEBER 191 #Proform 191 http://migre.me/TJwj 191 #Justinbieber 107 #nowplaying 104 @justinbieber 99 Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 109 Entity Frequency #music 88 #tickets 80 @_Yassi_ 78 #musicmonday 78 #video 78 #Dschungel 74 #Celebrity 42 #beliebers 38 #BieberFact 38 @JustBieberFact 32 @TinselTownDirt 32 @rheinzeitung 28 #WTF 28 http://tinyurl.com/343kax4 28 #Telezwerge 26 #Escutando 22 #justinBieber 22 #Restart 22 #TT 22 http://bit.ly/aARD4t 21 http://bit.ly/b2Kc1L 21 #bieberblast 20 #Eclipse 20 #somebodytolove 20 110 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet 7. A “long tail” or “heavy tail” refers to a feature of statistical distributions in which a significant portion (usually 50 percent or more) of the area under the curve exists within its tail. Figure 4-4. Distribution of entities co-occurring with #JustinBieber and #TeaParty What’s immediately obvious is that the #TeaParty tweets seem to have a lot more area “under the curve” and a much longer tail7 (if you can even call it a tail) than the #JustinBieber tweets. Thus, at a glance, it would seem that the average number of hashtags for #TeaParty tweets would be higher than for #JustinBieber tweets. The next section in‐ vestigates this assumption, but before we move on, let’s make a few more observations about these results. A cursory qualitative assess‐ ment of the results seems to indicate that the information encoded into the entities themselves is richer for #TeaParty. For example, in #TeaParty entities, we see topics such as #oilspill, #Obama, #palin, #libertarian, and @Drudge_Report, among others. In contrast, many of the most frequently occurring #JustinBieber entities are simply var‐ iations of #JustinBieber, with the rest of the hashtags being somewhat scattered and unfocused. Keep in mind, however, that this isn’t all that unexpected, given that #TeaParty is a very political topic whereas #JustinBieber is associated with pop culture and entertainment. Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 111 Some other observations are that a couple of user entities (@lojadoal‐ tivo and @ProSieben) appear in the top few results—higher than the “official” @justinbieber account itself—and that many of the entities that co-occur most often with #JustinBieber are non-English words or user entities, often associated with the entertainment industry. Having briefly scratched the surface of a qualitative assessment, let’s now return to the question of whether there are definitively more hashtags per tweet for #TeaParty than #JustinBieber. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags? Example 4-13 provides a working implementation for counting the average number of hashtags per tweet and can be readily applied to the search-justinbieber and search-teaparty databases without any additional work required. Tallying the results for the two databases reveals that #JustinBieber tweets average around 1.95 hashtags per tweet, while #TeaParty tweets have around 5.15 hashtags per tweet. That’s approximately 2.5 times more hashtags for #TeaParty tweets than #JustinBieber tweets. Al‐ though this isn’t necessarily the most surprising find in the world, having firm data points on which to base further explorations or to back up conjectures is helpful: they are quantifiable results that can be tracked over time, or shared and reassessed by others. Although the difference in this case is striking, keep in mind that the data collected is whatever Twitter handed us back as the most recent ~3,000 tweets for each topic via the search API. It isn’t necessarily statistically significant, even though it is probably a very good indica‐ tor and very well may be so. Whether they realize it or not, #TeaParty Twitterers are big believers in folksonomies: they clearly have a vested interest in ensuring that their content is easily accessible and cross- referenced via search APIs and data hackers such as ourselves. Which Gets Retweeted More Often: #JustinBieber or #TeaParty? Earlier in this chapter, we made the reasonable conjecture that tweets that are retweeted with high frequency are likely to be more influential and more informative or editorial in nature than ones that are not. Tweets such as “Eating a pretzel” and “Aliens have just landed on the 112 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet White House front lawn; we are all going to die! #fail #apocalypse” being extreme examples of content that is fairly unlikely and likely to be retweeted, respectively. How does #TeaParty compare to #Justin‐ Bieber for retweets? Analyzing @mentions from the working set of search results again produces interesting results. Truncated results showing which users have retweeted #TeaParty and #JustinBieber most often using a threshold with a frequency parameter of 10 appear in Tables Table 4-4 and Table 4-5. Table 4-4. Most frequent retweeters of #TeaParty Entity Frequency @teapartyleader 10 @dhrxsol1234 11 @HCReminder 11 @ObamaBallBuster 11 @spitfiremurphy 11 @GregWHoward 12 @BrnEyeSuss 13 @Calroofer 13 @grammy620 13 @Herfarm 14 @andilinks 16 @c4Liberty 16 @FloridaPundit 16 @tlw3 16 @Kriskxx 18 @crispix49 19 @JIDF 19 @libertyideals 19 @blogging_tories 20 @Liliaep 21 @STOPOBAMA2012 22 @First_Patriots 23 @RonPaulNews 23 @TheFlaCracker 24 @thenewdeal 25 @ResistTyranny 29 Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 113 Entity Frequency @ALIPAC 32 @Drudge_Report 38 @welshman007 39 Table 4-5. Most frequent retweeters of #JustinBieber Entity Frequency @justinbieber 14 @JesusBeebs 16 @LeePhilipEvans 16 @JustBieberFact 32 @TinselTownDirt 32 @ProSieben 122 @lojadoaltivo 189 If you do some back of the envelope analysis by running Example 4-4 on the ~3,000 tweets for each topic, you’ll discover that about 1,335 of the #TeaParty tweets are retweets, while only about 763 of the #Jus‐ tinBieber tweets are retweets. That’s practically twice as many retweets for #TeaParty than #JustinBieber. You’ll also observe that #TeaParty has a much longer tail, checking in with over 400 total retweets against #JustinBieber’s 131 retweets. Regardless of statistical rigor, intuitively, those are probably pretty relevant indicators that size up the different interest groups in meaningful ways. It would seem that #TeaParty folks more consistently retweet content than #JustinBieber folks; however, of the #JustinBieber folks who do retweet content, there are clearly a few outliers who retweet much more frequently than others. Figure 4-5 displays a simple chart of the values from Tables Table 4-4 and Table 4-5. As with Figure 4-4, the y-axis is a log scale, which makes the chart a little more readable by squashing the frequency values to re‐ quire less vertical space. 114 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Figure 4-5. Distribution of users who have retweeted #JustinBieber and #TeaParty How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets? A final looming question that might be keeping you up at night is how much overlap exists between the entities parsed out of the #TeaParty and #JustinBieber tweets. Borrowing from some of the ideas discussed previously, we’re essentially asking for the logical intersection of the two sets of entities. Although we could certainly compute this by tak‐ ing the time to adapt existing Python code, it might be even easier to just capture the results of the scripts we already have on hand into two files and pass those filenames as parameters into a disposable script that provides a general-purpose facility for computing the intersection of any line-delimited file. In addition to getting the job done, this ap‐ proach also leaves you with artifacts that you can casually peruse and readily share with others. Assuming you are working in a *nix shell with the script count-entities-in-tweets.py, one approach for capturing the entities from the #TeaParty and #JustinBieber output of Example 4-4 and storing them in sorted order follows: #!/bin/bash mkdir -p out for db in teaparty justinbieber; do python the_tweet__count_entities_in_tweets.py search-$db 0 | \ tail +3 | awk '{print $2}' | sort > out/$db.entities done Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) | 115 After you’ve run this script, you can pass the two filenames into the general-purpose Python program to compute the output, as shown in Example 4-15. Example 4-15. Computing the set intersection of lines in files (the_tweet__compute_intersection_of_lines_in_files.py) # -*- coding: utf-8 -*- """ Read in 2 or more files and compute the logical intersection of the lines in them """ import sys data = {} for i in range(1, len(sys.argv)): data[sys.argv[i]] = set(open(sys.argv[i]).readlines()) intersection = set() keys = data.keys() for k in range(len(keys) - 1): intersection = data[keys[k]].intersection(data[keys[k - 1]]) msg = 'Common items shared amongst %s:' % ', '.join(keys).strip() print msg print '-' * len(msg) for i in intersection: print i.strip() The entities shared between #JustinBieber and #TeaParty are some‐ what predictable, yet interesting. Example 4-16 lists the results from our sample. Example 4-16. Sample results from Example 4-15 Common items shared amongst teaparty.entities, justinbieber.enti- ties: ------------------------------------------------------------------- -- #lol #jesus #worldcup #teaparty #AZ #milk #ff #guns #WorldCup #bp 116 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet #News #dancing #music #glennbeck http://www.linkati.com/q/index @addthis #nowplaying #news #WTF #fail #toomanypeople #oilspill #catholic It shouldn’t be surprising that #WorldCup, #worldcup, and #oilspill are in the results, given that they’re pretty popular topics; however, having #teaparty, #glennbeck, #jesus, and #catholic show up on the list of shared hashtags might be somewhat of a surprise if you’re not that familiar with the TeaParty movement. Further analysis could very easily determine exactly how strong the correlations are between the two searches by accounting for how frequently certain hashtags appear in each search. One thing that’s immediately clear from these results is that none of these common entities appears in the top 20 most fre‐ quent entities associated with #JustinBieber, so that’s already an indi‐ cator that they’re out in the tail of the frequency distribution for #Jus‐ tinBieber mentions. (And yes, having #WTF and #fail show up on the list at all, especially as a common thread between two diverse groups, is sort of funny. Experiencing frustration is, unfortunately, a common thread of humanity.) If you want to dig deeper, as a further exercise you might reuse Example 4-7 to enable full-text indexing on the tweets in order to search by keyword. Visualizing Tons of Tweets There are more interesting ways to visualize Twitter data than we could possibly cover in this short chapter, but that won’t stop us from work‐ ing through a couple of exercises with some of the more obvious ap‐ proaches that provide a good foundation. In particular, we’ll look at loading tweet entities into tag clouds and visualizing “connections” among users with graphs. Visualizing Tons of Tweets | 117 Visualizing Tweets with Tricked-Out Tag Clouds Tag clouds are among the most obvious choices for visualizing the extracted entities from tweets. There are a number of interesting tag cloud widgets that you can find on the Web to do all of the hard work, and they all take the same input—essentially, a frequency distribution like the ones we’ve been computing throughout this chapter. But why visualize data with an ordinary tag cloud when you could use a highly customizable Flash-based rotating tag cloud? There just so happens to be a quite popular open source rotating tag cloud called WP- Cumulus that puts on a nice show. All that’s needed to put it to work is to produce the simple input format that it expects and feed that input format to a template containing the standard HTML boilerplate. Example 4-17 is a trivial adaptation of Example 4-4 that illustrates a routine emitting a simple JSON structure (a list of [term, URL, fre‐ quency] tuples) that can be fed into an HTML template for WP- Cumulus. We’ll pass in empty strings for the URL portion of those tuples, but you could use your imagination and hyperlink to a simple web service that displays a list of tweets containing the entities. (Recall that Example 4-7 provides just about everything you’d need to wire this up by using couchdb-lucene to perform a full-text search on tweets stored in CouchDB.) Another option might be to write a web service and link to a URL that provides any tweet containing the specified entity. Example 4-17. Generating the data for an interactive tag cloud using WP-Cumulus (the_tweet__tweet_tagcloud_code.py) # -*- coding: utf-8 -*- import os import sys import webbrowser import json from cgi import escape from math import log import couchdb from couchdb.design import ViewDefinition DB = sys.argv[1] MIN_FREQUENCY = int(sys.argv[2]) HTML_TEMPLATE = '../web_code/wp_cumulus/tagcloud_template.html' MIN_FONT_SIZE = 3 MAX_FONT_SIZE = 20 118 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet server = couchdb.Server('http://localhost:5984') db = server[DB] # Map entities in tweets to the docs that they appear in def entityCountMapper(doc): if not doc.get('entities'): import twitter_text def getEntities(tweet): # Now extract various entities from it and build up a familiar structure extractor = twitter_text.Extractor(tweet['text']) # Note that the production Twitter API contains a few additional fields in # the entities hash that would require additional API calls to resolve entities = {} entities['user_mentions'] = [] for um in extractor.extract_men- tioned_screen_names_with_indices(): entities['user_mentions'].append(um) entities['hashtags'] = [] for ht in extractor.extract_hashtags_with_indices(): # massage field name to match production twitter api ht['text'] = ht['hashtag'] del ht['hashtag'] entities['hashtags'].append(ht) entities['urls'] = [] for url in extractor.extract_urls_with_indices(): entities['urls'].append(url) return entities doc['entities'] = getEntities(doc) if doc['entities'].get('user_mentions'): for user_mention in doc['entities']['user_mentions']: yield ('@' + user_mention['screen_name'].lower(), [doc['_id'], doc['id']]) if doc['entities'].get('hashtags'): Visualizing Tons of Tweets | 119 for hashtag in doc['entities']['hashtags']: yield ('#' + hashtag['text'], [doc['_id'], doc['id']]) def summingReducer(keys, values, rereduce): if rereduce: return sum(values) else: return len(values) view = ViewDefinition('index', 'entity_count_by_doc', entityCount- Mapper, reduce_fun=summingReducer, language='python') view.sync(db) entities_freqs = [(row.key, row.value) for row in db.view('index/entity_count_by_doc', group=True)] # Create output for the WP-Cumulus tag cloud and sort terms by freq along the way raw_output = sorted([[escape(term), '', freq] for (term, freq) in entities_freqs if freq > MIN_FREQUENCY], key=lambda x: x[2]) # Implementation adapted from # http://help.com/post/383276-anyone-knows-the-formula-for-font-s min_freq = raw_output[0][2] max_freq = raw_output[-1][2] def weightTermByFreq(f): return (f - min_freq) * (MAX_FONT_SIZE - MIN_FONT_SIZE) / (max_freq - min_freq) + MIN_FONT_SIZE weighted_output = [[i[0], i[1], weightTermByFreq(i[2])] for i in raw_output] # Substitute the JSON data structure into the template html_page = open(HTML_TEMPLATE).read() % \ (json.dumps(weighted_output),) if not os.path.isdir('out'): os.mkdir('out') f = open(os.path.join('out', os.path.basename(HTML_TEMPLATE)), 'w') 120 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet f.write(html_page) f.close() print 'Tagcloud stored in: %s' % f.name # Open up the web page in your browser webbrowser.open("file://" + os.path.join(os.getcwd(), 'out', os.path.basename(HTML_TEMPLATE))) The most notable portion of the listing is the incorporation of the following formula that weights the tags in the cloud: This formula weights term frequencies such that they are linearly squashed between MIN_FONT_SIZE and MAX_FONT_SIZE by tak‐ ing into account the frequency for the term in question along with the maximum and minimum frequency values for the data. There are many variations that could be applied to this formula, and the incor‐ poration of logarithms isn’t all that uncommon. Kevin Hoffman’s pa‐ per, “In Search of the Perfect Tag Cloud”, provides a nice overview of various design decisions involved in crafting tag clouds and is a useful starting point if you’re interested in taking a deeper dive into tag cloud design. The tagcloud_template.html file that’s referenced in Example 4-17 is fairly uninteresting and is available with this book’s source code on GitHub; it is nothing more than a simple adaptation from the stock example that comes with the tag cloud’s source code. Some script tags in the head of the page take care of all of the heavy lifting, and all you have to do is feed some data into a makeshift template, which simply uses string substitution to replace the %s placeholder. Figures Figure 4-6 and Figure 4-7 display tag clouds for tweet entities with a frequency greater than 30 that co-occur with #JustinBieber and #TeaParty. The most obvious difference between them is how crowded the #TeaParty tag cloud is compared to the #JustinBieber cloud. The gisting of other topics associated with the query terms is also readily apparent. We already knew this from Figure 4-5, but a tag cloud con‐ veys a similar gist and provides useful interactive capabilities to boot. Of course, there’s also nothing stopping you from creating interactive Ajax Charts with tools such as Google Chart Tools. Visualizing Tons of Tweets | 121 Figure 4-6. An interactive 3D tag cloud for tweet entities co-occurring with #JustinBieber Visualizing Community Structures in Twitter Search Results We briefly compared #JustinBieber and #TeaParty queries earlier in this chapter, and this section takes that analysis a step further by in‐ troducing a couple of visualizations from a slightly different angle. Let’s take a stab at visualizing the community structures of #TeaParty and #JustinBieber Twitterers by taking the search results we’ve previ‐ ously collected, computing friendships among the tweet authors and other entities (such as @mentions and #hashtags) appearing in those tweets, and visualizing those connections. In addition to yielding extra insights into our example, these techniques also provide useful starting points for other situations in which you have an interesting juxtapo‐ sition in mind. The code listing for this workflow won’t be shown here because it’s easily created by recycling code you’ve already seen earlier in this chapter and in previous chapters. The high-level steps involved include: • Computing the set of screen names for the unique tweet authors and user mentions associated with the search-teaparty and search-justinbieber CouchDB databases 122 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet • Harvesting the friend IDs for the screen names with Twitter’s / friends/ids resource • Resolving screen names for the friend IDs with Twitter’s /users/ lookup resource (recall that there’s not a direct way to look up screen names for friends; ID values must be collected and then resolved) • Constructing a networkx.Graph by walking over the friendships and creating an edge between two nodes where a friendship exists in either direction • Analyzing and visualizing the graphs Figure 4-7. An interactive 3D tag cloud for tweet entities co-occurring with #TeaParty The result of the script is a pickled graph file that you can open up in the interpreter and poke around at, as illustrated in the interpreter session in Example 4-18. Because the script is essentially just bits and pieces of recycled logic from earlier examples, it’s not included here in the text, but is available online. The analysis is generated from running Visualizing Tons of Tweets | 123 8. The degree of a node in an undirected graph is its total number of edges. the script on approximately 3,000 tweets for each of the #JustinBieber and #TeaParty searches. The output from calls to nx.degree, which returns the degree8 of each node in the graph, is omitted and rendered visually as a simple column chart in Figure 4-8. Example 4-18. Using the interpreter to perform ad-hoc graph analysis >>> import networkx as nx >>> teaparty = nx.read_gpickle("out/search-teaparty.gpickle") >>> justinbieber = nx.read_gpickle("out/search- justinbieber.gpickle") >>> teaparty.number_of_nodes(), teaparty.number_of_edges() (2262, 129812) >>> nx.density(teaparty) 0.050763513558431887 >>> sorted(nx.degree(teaparty) ... output omitted ... >>> justinbieber.number_of_nodes(), justinbieber.number_of_edges() (1130, 1911) >>> nx.density(justinbieber) >>> 0.0029958378077553165 >>> sorted(nx.degree(teaparty)) ... output omitted ... Figure 4-8. A simple way to compare the “connectedness” of graphs is to plot out the sorted degree of each node and overlay the curves 124 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet Without any visual representation at all, the results from the inter‐ preter session are very telling. In fact, it’s critical to be able to reason about these kinds of numbers without seeing a visualization because, more often than not, you simply won’t be able to easily visualize com‐ plex graphs without a large time investment. The short story is that the density and the ratio of nodes to edges (users to friendships among them) in the #TeaParty graph vastly outstrip those for the #Justin‐ Bieber graph. However, that observation in and of itself may not be the most surprising thing. The surprising thing is the incredibly high connectedness of the nodes in the #TeaParty graph. Granted, #Tea‐ Party is clearly an intense political movement with a relatively small following, whereas #JustinBieber is a much less concentrated enter‐ tainment topic with international appeal. Still, the overall distribution and shape of the curve for the #TeaParty results provide an intuitive and tangible view of just how well connected the #TeaParty folks really are. A 2D visual representation of the graph doesn’t provide too much additional information, but the suspense is probably killing you by now, so without further ado—Figure 4-9. Recall that *nix users can write out a Graphviz output with nx.draw‐ ing.write_dot, but Windows users may need to manually generate the DOT language output. For large undirected graphs, you’ll find Graph‐ viz’s SFDP (scalable force directed placement) engine invaluable. Sam‐ ple usage for how you might invoke it from a command line to produce a useful layout for a dense graph follows: $sfdp -Tpng - Oteaparty -Nfixedsize=true -Nlabel='' -Nstyle=filled - Nfillcolor=red - Nheight=0.1 -Nwidth=0.1 -Nshape=circle -Gratio=fill - Goutputorder=edgesfirst - Gratio=fill -Gsize='10!' -Goverlap=prism teaparty.dot Visualizing Tons of Tweets | 125 Figure 4-9. A 2D representation showcasing the connectedness of #Jus‐ tinBieber (left) and #TeaParty (right) Twitterers; each edge represents a friendship that exists in at least one direction Closing Remarks There are a ridiculous number of obvious things you can do with tweet data, and if you get the least bit creative, the possibilities become be‐ yond ridiculous. We’ve barely scratched the surface in this chapter. An entire book could quite literally be devoted to systematically working through more of the possibilities, and many small businesses focused around tweet analytics could potentially turn a fairly healthy profit by selling answers to certain classes of ad-hoc queries that customers could make. Here are interesting ideas that you could pursue: • Define a similarity metric and use it to compare or further analyze two Twitter users or groups of users. You’ll essentially be devel‐ oping a profile and then measuring how well a particular user fits that profile. For example, do certain hashtags, keywords (e.g., LOL, OMG, etc.), or similar metrics appear in #JustinBieber tweets more than in more intellectual tweets like ones associated with #gov20? This question might seem obvious, but what inter‐ esting things might you learn along the way while calculating a quantifiable answer? • Which celebrity Twitterers have a very low concentration of tweet entities yet a high tweet volume? Does this make them ramblers —you know, those folks who tweet 100 times a day about nothing in particular? Conversely, which celebrity Twitterers have a high 126 | Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet concentration of hashtags in their tweets? What obvious compar‐ isons can you make between the two groups? • If you have a Twitter account and a reasonable number of tweets, consider analyzing yourself and comparing your tweet streams to those of other users. How similar are you to various Twitterers along a particular spectrum? • We didn’t broach the subject of Twitter’s lists API. Can you profile various members of a list and try to discover latent social net‐ works? For example, can you find a metric that allows you to ap‐ proximate political boundaries between followers of the @timor‐ eilly/healthcare and @timoreilly/gov20 lists? • It might be interesting to analyze the volume of tweets or content of tweets by time of day. This is a pretty simple exercise and could incorporate the SIMILE Timeline—discussed previously—for easy visualization. • Near real-time communication is increasingly becoming the norm and Twitter provides a streaming API, which provides con‐ venient access to much of their “fire hose.” See the online stream‐ ing documentation and consider checking out tweepy as a terrific streaming client. Would You Like to Read More? Visit our website to purchase the full version of Mining the Social Web. Closing Remarks | 127 PART IV Planning for Big Data Introduction In February 2011, over 1,300 people came together for the inaugural O’Reilly Strata Conference in Santa Clara, California. Though repre‐ senting diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they were data scientists. Entrepreneurial and resourceful, combining pro‐ gramming skills with math, data scientists have emerged as a new profession leading the march towards data-driven business. This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services. Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called Hadoop, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data rev‐ olution is sweeping business, government and science, with conse‐ quences as far reaching and long lasting as the web itself. Every revolution has to start somewhere, and the question for many is “how can data science and big data help my organization?” After years of data processing choices being straightforward, there’s now a diverse landscape to negotiate. What’s more, to become data-driven, you must grapple with changes that are cultural as well as technolog‐ ical. cxxxi The aim of this book is to help you understand what big data is, why it matters, and where to get started. If you’re already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities. I am grateful to my fellow O’Reilly Radar authors for contributing articles in addition to myself: Alistair Croll, Julie Steele and Mike Lou‐ kides. —Edd Dumbill Program Chair, O’Reilly Strata Conference, February 2012 cxxxii | Introduction The Feedback Economy Alistair Croll Military strategist John Boyd spent a lot of time understanding how to win battles. Building on his experience as a fighter pilot, he broke down the process of observing and reacting into something called an Observe, Orient, Decide, and Act (OODA) loop. Combat, he realized, consisted of observing your circumstances, orienting yourself to your enemy’s way of thinking and your environment, deciding on a course of action, and then acting on it. The most important part of this loop isn’t included in the OODA ac‐ ronym, however. It’s the fact that it’s a loop. The results of earlier actions feed back into later, hopefully wiser, ones. Over time, the fighter “gets inside” their opponent’s loop, outsmarting and outmaneuvering them. The system learns. Boyd’s genius was to realize that winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what’s learned into the next 133 iteration. Today, what Boyd learned in a cockpit applies to nearly ev‐ erything we do. Data-Obese, Digital-Fast In our always-on lives we’re flooded with cheap, abundant informa‐ tion. We need to capture and analyze it well, separating digital wheat from digital chaff, identifying meaningful undercurrents while ignor‐ ing meaningless social flotsam. Clay Johnson argues that we need to go on an information diet, and makes a good case for conscious con‐ sumption. In an era of information obesity, we need to eat better. There’s a reason they call it a feed, after all. It’s not just an overabundance of data that makes Boyd’s insights vital. In the last 20 years, much of human interaction has shifted from atoms to bits. When interactions become digital, they become instantaneous, interactive, and easily copied. It’s as easy to tell the world as to tell a friend, and a day’s shopping is reduced to a few clicks. The move from atoms to bits reduces the coefficient of friction of entire industries to zero. Teenagers shun e-mail as too slow, opting for instant messages. The digitization of our world means that trips around the OODA loop happen faster than ever, and continue to accelerate. We’re drowning in data. Bits are faster than atoms. Our jungle-surplus wetware can’t keep up. At least, not without Boyd’s help. In a society where every person, tethered to their smartphone, is both a sensor and an end node, we need better ways to observe and orient, whether we’re at home or at work, solving the world’s problems or planning a play date. And we need to be constantly deciding, acting, and experiment‐ ing, feeding what we learn back into future behavior. We’re entering a feedback economy. The Big Data Supply Chain Consider how a company collects, analyzes, and acts on data. 134 | The Feedback Economy Let’s look at these components in order. Data collection The first step in a data supply chain is to get the data in the first place. Information comes in from a variety of sources, both public and pri‐ vate. We’re a promiscuous society online, and with the advent of low- cost data marketplaces, it’s possible to get nearly any nugget of data relatively affordably. From social network sentiment, to weather re‐ ports, to economic indicators, public information is grist for the big data mill. Alongside this, we have organization-specific data such as retail traffic, call center volumes, product recalls, or customer loyalty indicators. The legality of collection is perhaps more restrictive than getting the data in the first place. Some data is heavily regulated—HIPAA governs healthcare, while PCI restricts financial transactions. In other cases, the act of combining data may be illegal because it generates personally identifiable information (PII). For example, courts have ruled differ‐ ently on whether IP addresses aren’t PII, and the California Supreme Court ruled that zip codes are. Navigating these regulations imposes some serious constraints on what can be collected and how it can be combined. The era of ubiquitous computing means that everyone is a potential source of data, too. A modern smartphone can sense light, sound, motion, location, nearby networks and devices, and more, making it a perfect data collector. As consumers opt into loyalty programs and The Big Data Supply Chain | 135 install applications, they become sensors that can feed the data supply chain. In big data, the collection is often challenging because of the sheer volume of information, or the speed with which it arrives, both of which demand new approaches and architectures. Ingesting and cleaning Once the data is collected, it must be ingested. In traditional business intelligence (BI) parlance, this is known as Extract, Transform, and Load (ETL): the act of putting the right information into the correct tables of a database schema and manipulating certain fields to make them easier to work with. One of the distinguishing characteristics of big data, however, is that the data is often unstructured. That means we don’t know the inherent schema of the information before we start to analyze it. We may still transform the information—replacing an IP address with the name of a city, for example, or anonymizing certain fields with a one-way hash function—but we may hold onto the original data and only define its structure as we analyze it. Hardware The information we’ve ingested needs to be analyzed by people and machines. That means hardware, in the form of computing, storage, and networks. Big data doesn’t change this, but it does change how it’s used. Virtualization, for example, allows operators to spin up many machines temporarily, then destroy them once the processing is over. Cloud computing is also a boon to big data. Paying by consumption destroys the barriers to entry that would prohibit many organizations from playing with large datasets, because there’s no up-front invest‐ ment. In many ways, big data gives clouds something to do. Platforms Where big data is new is in the platforms and frameworks we create to crunch large amounts of information quickly. One way to speed up data analysis is to break the data into chunks that can be analyzed in parallel. Another is to build a pipeline of processing steps, each opti‐ mized for a particular task. 136 | The Feedback Economy Big data is often about fast results, rather than simply crunching a large amount of information. That’s important for two reasons: 1. Much of the big data work going on today is related to user in‐ terfaces and the web. Suggesting what books someone will enjoy, or delivering search results, or finding the best flight, requires an answer in the time it takes a page to load. The only way to ac‐ complish this is to spread out the task, which is one of the reasons why Google has nearly a million servers. 2. We analyze unstructured data iteratively. As we first explore a da‐ taset, we don’t know which dimensions matter. What if we seg‐ ment by age? Filter by country? Sort by purchase price? Split the results by gender? This kind of “what if” analysis is exploratory in nature, and analysts are only as productive as their ability to explore freely. Big data may be big. But if it’s not fast, it’s unintel‐ ligible. Much of the hype around big data companies today is a result of the retooling of enterprise BI. For decades, companies have relied on structured relational databases and data warehouses—many of them can’t handle the exploration, lack of structure, speed, and massive sizes of big data applications. Machine learning One way to think about big data is that it’s “more data than you can go through by hand.” For much of the data we want to analyze today, we need a machine’s help. Part of that help happens at ingestion. For example, natural language processing tries to read unstructured text and deduce what it means: Was this Twitter user happy or sad? Is this call center recording good, or was the customer angry? Machine learning is important elsewhere in the data supply chain. When we analyze information, we’re trying to find signal within the noise, to discern patterns. Humans can’t find signal well by themselves. Just as astronomers use algorithms to scan the night’s sky for signals, then verify any promising anomalies themselves, so too can data an‐ alysts use machines to find interesting dimensions, groupings, or pat‐ terns within the data. Machines can work at a lower signal-to-noise ratio than people. The Big Data Supply Chain | 137 Human exploration While machine learning is an important tool to the data analyst, there’s no substitute for human eyes and ears. Displaying the data in human- readable form is hard work, stretching the limits of multi-dimensional visualization. While most analysts work with spreadsheets or simple query languages today, that’s changing. Creve Maples, an early advocate of better computer interaction, de‐ signs systems that take dozens of independent, data sources and dis‐ plays them in navigable 3D environments, complete with sound and other cues. Maples’ studies show that when we feed an analyst data in this way, they can often find answers in minutes instead of months. This kind of interactivity requires the speed and parallelism explained above, as well as new interfaces and multi-sensory environments that allow an analyst to work alongside the machine, immersed in the data. Storage Big data takes a lot of storage. In addition to the actual information in its raw form, there’s the transformed information; the virtual machines used to crunch it; the schemas and tables resulting from analysis; and the many formats that legacy tools require so they can work alongside new technology. Often, storage is a combination of cloud and on- premise storage, using traditional flat-file and relational databases alongside more recent, post-SQL storage systems. During and after analysis, the big data supply chain needs a warehouse. Comparing year-on-year progress or changes over time means we have to keep copies of everything, along with the algorithms and queries with which we analyzed it. Sharing and acting All of this analysis isn’t much good if we can’t act on it. As with col‐ lection, this isn’t simply a technical matter—it involves legislation, or‐ ganizational politics, and a willingness to experiment. The data might be shared openly with the world, or closely guarded. The best companies tie big data results into everything from hiring and firing decisions, to strategic planning, to market positioning. While it’s easy to buy into big data technology, it’s far harder to shift 138 | The Feedback Economy an organization’s culture. In many ways, big data adoption isn’t a hard‐ ware retirement issue, it’s an employee retirement one. We’ve seen similar resistance to change each time there’s a big change in information technology. Mainframes, client-server computing, packet-based networks, and the web all had their detractors. A NASA study into the failure of Ada, the first object-oriented language, con‐ cluded that proponents had over-promised, and there was a lack of a supporting ecosystem to help the new language flourish. Big data, and its close cousin, cloud computing, are likely to encounter similar ob‐ stacles. A big data mindset is one of experimentation, of taking measured risks and assessing their impact quickly. It’s similar to the Lean Startup movement, which advocates fast, iterative learning and tight links to customers. But while a small startup can be lean because it’s nascent and close to its market, a big organization needs big data and an OODA loop to react well and iterate fast. The big data supply chain is the organizational OODA loop. It’s the big business answer to the lean startup. Measuring and collecting feedback Just as John Boyd’s OODA loop is mostly about the loop, so big data is mostly about feedback. Simply analyzing information isn’t particu‐ larly useful. To work, the organization has to choose a course of action from the results, then observe what happens and use that information to collect new data or analyze things in a different way. It’s a process of continuous optimization that affects every facet of a business. Replacing Everything with Data Software is eating the world. Verticals like publishing, music, real es‐ tate and banking once had strong barriers to entry. Now they’ve been entirely disrupted by the elimination of middlemen. The last film pro‐ jector rolled off the line in 2011: movies are now digital from camera to projector. The Post Office stumbles because nobody writes letters, even as Federal Express becomes the planet’s supply chain. Companies that get themselves on a feedback footing will dominate their industries, building better things faster for less money. Those that don’t are already the walking dead, and will soon be little more than Replacing Everything with Data | 139 case studies and colorful anecdotes. Big data, new interfaces, and ubiquitous computing are tectonic shifts in the way we live and work. A Feedback Economy Big data, continuous optimization, and replacing everything with data pave the way for something far larger, and far more important, than simple business efficiency. They usher in a new era for humanity, with all its warts and glory. They herald the arrival of the feedback economy. The efficiencies and optimizations that come from constant, iterative feedback will soon become the norm for businesses and governments. We’re moving beyond an information economy. Information on its own isn’t an advantage, anyway. Instead, this is the era of the feedback economy, and Boyd is, in many ways, the first feedback economist. Alistair Croll is the founder of Bitcurrent, a research firm focused on emerging technologies. He’s founded a variety of startups, and technology accelerators, including Year One Labs, CloudOps, Rednod, Coradiant (acquired by BMC in 2011) and Networkshop. He’s a frequent speaker and writer on subjects such as entrepreneurship, cloud computing, Big Data, Internet performance and web technology, and has helped launch a number of major conferences on these topics. 140 | The Feedback Economy What Is Big Data? Edd Dumbill Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it. The hot IT buzzword of 2012, big data has become viable as cost- effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work re‐ quired to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud. The value of big data to an organization falls into two categories: an‐ alytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transac‐ tions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports. The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized 141 user experience and create a new kind of advertising business. It’s no coincidence that the lion’s share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook. The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data re‐ quires experimentation and exploration. Whether creating new prod‐ ucts or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook. What Does Big Data Look Like? As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, bank‐ ing transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most prob‐ ably you will contend with each of the Vs to one degree or another. 142 | What Is Big Data? Volume The benefit gained from the ability to process large amounts of infor‐ mation is the main attraction of big data analytics. Having more data beats out having better models: simple bits of math can be unreason‐ ably effective given large amounts of data. If you could run that forecast taking into account 300 factors rather than 6, could you predict de‐ mand better? This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into a choice between massively parallel process‐ ing architectures—data warehouses or databases such as Greenplum —and Apache Hadoop-based solutions. This choice is often informed by the degree to which the one of the other “Vs”—variety—comes into play. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Ha‐ doop, on the other hand, places no conditions on the structure of the data it can process. At its core, Hadoop is a platform for distributing computing problems across a number of servers. First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop’s MapReduce in‐ volves distributing a dataset among multiple servers and operating on the data: the “map” stage. The partial results are then recombined: the “reduce” stage. To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A typical Hadoop usage pattern involves three stages: • loading data into HDFS, • MapReduce operations, and • retrieving results from HDFS. This process is by nature a batch operation, suited for analytical or non-interactive computing tasks. Because of this, Hadoop is not itself What Does Big Data Look Like? | 143 a database or data warehouse solution, but can act as an analytical adjunct to one. One of the most well-known Hadoop users is Facebook, whose model follows this pattern. A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as cre‐ ating recommendations for you based on your friends’ interests. Face‐ book then transfers the results back into MySQL, for use in pages served to users. Velocity The importance of data’s velocity—the increasing rate at which data flows into an organization—has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it’s our turn. Why is that so? The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumen‐ ted, generating a data flow back to the provider. Online retailers are able to compile large histories of customers’ every click and interac‐ tion: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data. It’s not just the velocity of the incoming data that’s the issue: it’s possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision. A commercial from IBM makes the point that you wouldn’t cross the road if all you had was a five minute old snapshot of traffic location. There are times when you simply won’t be able to wait for a report to run or a Hadoop job to complete. Industry terminology for such fast-moving data tends to be either “streaming data,” or “complex event processing.” This latter term was more established in product categories before streaming processing data gained more widespread relevance, and seems likely to diminish in favor of streaming. 144 | What Is Big Data? There are two main reasons to consider streaming processing. The first is when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it—hoping hard they’ve not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation. Product categories for handling streaming data divide into established proprietary products such as IBM’s InfoSphere Streams, and the less- polished and still emergent open source frameworks originating in the web industry: Twitter’s Storm, and Yahoo S4. As mentioned above, it’s not just about input data. The velocity of a system’s outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. The results might go directly into a product, such as Facebook’s recommendations, or into dashboards used to drive decision-making. It’s this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of precomputed information. These databases form part of an umbrella category known as NoSQL, used when rela‐ tional models aren’t the right fit. Variety Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn’t fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application. Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency. What Does Big Data Look Like? | 145 A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity reso‐ lution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing. The process of moving from source data to processed application data involves the loss of information. When you tidy up, you end up throw‐ ing stuff away. This underlines a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. If you lose the source data, there’s no going back. Despite the popularity and well understood nature of relational data‐ bases, it is not the case that they should always be the destination for data, even when tidied up. Certain data types suit certain classes of database better. For instance, documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient. Even where there’s not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it. In Practice We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes to deployment there are dimensions to consider over and above tool selection. Cloud or in-house? The majority of big data solutions are now provided in three forms: software-only, as an appliance or cloud-based. Decisions between which route to take will depend, among other things, on issues of data locality, privacy and regulation, human resources and project require‐ ments. Many organizations opt for a hybrid solution: using on- demand cloud resources to supplement in-house deployments. 146 | What Is Big Data? Big data is big It is a fundamental fact that data that is too big to process conven‐ tionally is also too big to transport anywhere. IT is undergoing an inversion of priorities: it’s the program that needs to move, not the data. If you want to analyze data from the U.S. Census, it’s a lot easier to run your code on Amazon’s web services platform, which hosts such data locally, and won’t cost you time or money to transfer it. Even if the data isn’t too big to move, locality can still be an issue, especially with rapidly updating data. Financial trading systems crowd into data centers to get the fastest connection to source data, because that millisecond difference in processing time equates to competitive advantage. Big data is messy It’s not all about infrastructure. Big data practitioners consistently re‐ port that 80% of the effort involved in dealing with data is cleaning it up in the first place, as Pete Warden observes in his Big Data Glossa‐ ry: “I probably spend more time turning messy source data into some‐ thing usable than I do on the rest of the data analysis process com‐ bined.” Because of the high cost of data acquisition and cleaning, it’s worth considering what you actually need to source yourself. Data market‐ places are a means of obtaining common data, and you are often able to contribute improvements back. Quality can of course be variable, but will increasingly be a benchmark on which data marketplaces compete. Culture The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct. Benefiting from big data means investing in teams with this skillset, and surrounding them with an organizational willingness to understand and use data for advantage. In his report, “Building Data Science Teams,” D.J. Patil characterizes data scientists as having the following qualities: • Technical expertise: the best data scientists typically have deep expertise in some scientific discipline. In Practice | 147 • Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested. • Storytelling: the ability to use data to tell a story and to be able to communicate it effectively. • Cleverness: the ability to look at a problem in different, creative ways. The far-reaching nature of big data analytics projects can have un‐ comfortable aspects: data must be broken out of silos in order to be mined, and the organization must learn how to communicate and in‐ terpet the results of analysis. Those skills of storytelling and cleverness are the gateway factors that ultimately dictate whether the benefits of analytical labors are absor‐ bed by an organization. The art and practice of visualizing data is be‐ coming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way. Know where you want to go Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then what? Christer Johnson, IBM’s leader for advanced analytics in North America, gives this advice to busi‐ nesses starting out with big data: first, decide what problem you want to solve. If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a concrete goal. Edd Dumbill is a technologist, writer and programmer based in Cali‐ fornia. He is the program chair for the O’Reilly Strata and Open Source Convention Conferences. 148 | What Is Big Data? Apache Hadoop Edd Dumbill Apache Hadoop has been the driving force behind the growth of the big data industry. You’ll hear it mentioned often, along with associated technologies such as Hive and Pig. But what does it do, and why do you need all its strangely named friends such as Oozie, Zookeeper and Flume? Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before? Existing enterprise data warehouses and relational databases excel at processing structured data, and can store massive amounts of data, though at cost. However, this requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive het‐ erogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference. This article examines the components of the Hadoop ecosystem and explains the functions of each. 149 The Core of Hadoop: MapReduce Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing. In addition to Hadoop, you’ll find Map‐ Reduce inside MPP and NoSQL databases such as Vertica or Mon‐ goDB. The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays. At its core, Hadoop is an open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached “web scale” capability in early 2008. As the Hadoop project matured, it acquired further components to enhance its usability and functionality. The name “Hadoop” has come to represent this entire ecosystem. There are parallels with the emer‐ gence of Linux: the name refers strictly to the Linux kernel, but it has gained acceptance as referring to a complete operating system. Hadoop’s Lower Levels: HDFS and MapReduce We discussed above the ability of MapReduce to distribute computa‐ tion over multiple servers. For that computation to take place, each server must have access to the data. This is the role of HDFS, the Ha‐ doop Distributed File System. HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail, and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas defined before storing the data. With HDFS, making sense of the data is the responsibility of the de‐ veloper’s code. 150 | Apache Hadoop Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS. Improving Programmability: Pig and Hive Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier. • Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transforma‐ tions on the data, and storing the final results. Pig’s built-in op‐ erations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for cus‐ tom data types and transformations. • Hive enables Hadoop to operate as a data warehouse. It superim‐ poses structure on data in HDFS, and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive’s core capabilities are extensible. Choosing between Hive and Pig can be confusing. Hive is more suit‐ able for data warehousing tasks, with predominantly static structure and the need for frequent analysis. Hive’s closeness to SQL makes it an ideal point of integration between Hadoop and other business in‐ telligence tools. Pig gives the developer more agility for the exploration of large data‐ sets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. As such, Pig’s intended audience remains primarily the software developer. Improving Data Access: HBase, Sqoop, and Flume At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then retrieved. This is somewhat of a computing throwback, and often interactive and random access to data is re‐ quired. Improving Programmability: Pig and Hive | 151 Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase. In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approxi‐ mately a petabyte, versus HDFS’ limit of over 30PB. HBase is ill-suited to ad-hoc analytics, and more appropriate for in‐ tegrating big data as part of a larger application. Use cases include logging, counting and storing time-series data. The Hadoop Bestiary Ambari Deployment, configuration and monitoring Flume Collection and import of log and event data HBase Column-oriented database scaling to billions of rows HCatalog Schema and data type sharing over Pig, Hive and MapReduce HDFS Distributed redundant filesystem for Hadoop Hive Data warehouse with SQL-like access Mahout Library of machine learning and data mining algorithms MapReduce Parallel computation on server clusters Pig High-level programming language for Hadoop computations Oozie Orchestration and workflow management Sqoop Imports data from relational databases Whirr Cloud-agnostic deployment of clusters Zookeeper Configuration management and coordination Getting data in and out Improved interoperability with the rest of the data world is provided by Sqoop and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop: either directly into HDFS, or into Hive. Flume is designed to import streaming flows of log data directly into HDFS. 152 | Apache Hadoop Hive’s SQL friendliness means that it can be used as a point of inte‐ gration with the vast universe of database tools capable of making connections via JBDC or ODBC database drivers. Coordination and Workflow: Zookeeper and Oozie With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and how they should be configured. This is the purpose of Zookeeper. Production systems utilizing Hadoop can often contain complex pipe‐ lines of transformations, each with dependencies on each other. For example, the arrival of a new batch of data will trigger an import, which must then trigger recalculates in dependent datasets. The Oozie com‐ ponent provides features to manage the workflow and dependencies, removing the need for developers to code custom solutions. Management and Deployment: Ambari and Whirr One of the commonly added features incorporated into Hadoop by distributors such as IBM and Microsoft is monitoring and adminis‐ tration. Though in an early stage, Ambari aims to add these features to the core Hadoop project. Ambari is intended to help system ad‐ ministrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API it may be integrated with other system management tools. Though not strictly part of Hadoop, Whirr is a highly complementary component. It offers a way of running services, including Hadoop, on cloud platforms. Whirr is cloud-neutral, and currently supports the Amazon EC2 and Rackspace services. Machine Learning: Mahout Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses per‐ formed on that data. The Mahout project is a library of Hadoop im‐ Coordination and Workflow: Zookeeper and Oozie | 153 plementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification. Using Hadoop Normally, you will use Hadoop in the form of a distribution. Much as with Linux before it, vendors integrate and test the components of the Apache Hadoop ecosystem, and add in tools and administrative fea‐ tures of their own. Though not per se a distribution, a managed cloud installation of Ha‐ doop’s MapReduce is also available through Amazon’s Elastic MapRe‐ duce service. 154 | Apache Hadoop Big Data Market Survey Edd Dumbill The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steam‐ rolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To con‐ fuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution. Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of the definition of big data: “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database ap‐ proach. This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on ana‐ lyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those prod‐ ucts which only sit above those layers, such as the visualization or analytical workbench software. Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the 155 Amazon Web Services cloud. But for production environments, sup‐ port, professional services and training are often required. Just Hadoop? Apache Hadoop is unquestionably the center of the latest iteration of big data solutions. At its heart, Hadoop is a system for distributing computation among commodity servers. It is often used with the Ha‐ doop Hive project, which layers data warehouse technology on top of Hadoop, enabling ad-hoc analytical queries. Big data platforms divide along the lines of their approach to Hadoop. The big data offerings from familiar enterprise vendors incorporate a Hadoop distribution, while other platforms offer Hadoop connectors to their existing analytical database systems. This latter category tends to comprise massively parallel processing (MPP) databases that made their name in big data before Hadoop matured: Vertica and Aster Data. Hadoop’s strength in these cases is in processing unstructured data in tandem with the analytical capabilities of the existing database on structured or structured data. Practical big data implementations don’t in general fall neatly into ei‐ ther structured or unstructured data categories. You will invariably find Hadoop working as part of a system with a relational or MPP database. Much as with Linux before it, no Hadoop solution incorporates the raw Apache Hadoop code. Instead, it’s packaged into distributions. At a minimum, these distributions have been through a testing process, and often include additional components such as management and monitoring tools. The most well-used distributions now come from Cloudera, Hortonworks and MapR. Not every distribution will be commercial, however: the BigTop project aims to create a Hadoop distribution under the Apache umbrella. Integrated Hadoop Systems The leading Hadoop enterprise software vendors have aligned their Hadoop products with the rest of their database and analytical offer‐ ings. These vendors don’t require you to source Hadoop from another party, and offer it as a core part of their big data solutions. Their of‐ ferings integrate Hadoop into a broader enterprise setting, augmented by analytical and workflow tools. 156 | Big Data Market Survey EMC Greenplum EMC Greenplum Database • Greenplum Database Deployment options • Appliance (Modular Data Computing Appliance), Software (En‐ terprise Linux) Hadoop • Bundled distribution (Greenplum HD); Hive, Pig, Zookeeper, HBase NoSQL component • HBase Links • Home page, case study Acquired by EMC, and rapidly taken to the heart of the company’s strategy, Greenplum is a relative newcomer to the enterprise, com‐ pared to other companies in this section. They have turned that to their advantage in creating an analytic platform, positioned as taking analytics “beyond BI” with agile data science teams. Greenplum’s Unified Analytics Platform (UAP) comprises three ele‐ ments: the Greenplum MPP database, for structured data; a Hadoop distribution, Greenplum HD; and Chorus, a productivity and group‐ ware layer for data science teams. The HD Hadoop layer builds on MapR’s Hadoop compatible distri‐ bution, which replaces the file system with a faster implementation and provides other features for robustness. Interoperability between HD and Greenplum Database means that a single query can access both database and Hadoop data. Integrated Hadoop Systems | 157 Chorus is a unique feature, and is indicative of Greenplum’s commit‐ ment to the idea of data science and the importance of the agile team element to effectively exploiting big data. It supports organizational roles from analysts, data scientists and DBAs through to executive business stakeholders. As befits EMC’s role in the data center market, Greenplum’s UAP is available in a modular appliance configuration. IBM IBM InfoSphere Database • DB2 Deployment options • Software (Enterprise Linux), Cloud Hadoop • Bundled distribution (InfoSphere BigInsights); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene NoSQL component • HBase Links • Home page, case study IBM’s InfoSphere BigInsights is their Hadoop distribution, and part of a suite of products offered under the “InfoSphere” information management brand. Everything big data at IBM is helpfully labeled Big, appropriately enough for a company affectionately known as “Big Blue.” BigInsights augments Hadoop with a variety of features, including management and administration tools. It also offers textual analysis 158 | Big Data Market Survey tools that aid with entity resolution—identifying people, addresses, phone numbers, and so on. IBM’s Jaql query language provides a point of integration between Hadoop and other IBM products, such as relational databases or Ne‐ tezza data warehouses. InfoSphere BigInsights is interoperable with IBM’s other database and warehouse products, including DB2, Netezza and its InfoSphere ware‐ house and analytics lines. To aid analytical exploration, BigInsights ships with BigSheets, a spreadsheet interface onto big data. IBM addresses streaming big data separately through its InfoSphere Streams product. BigInsights is not currently offered in an appliance form, but can be used in the cloud via Rightscale, Amazon, Rackspace, and IBM Smart Enterprise Cloud. Microsoft Microsoft Database • SQL Server Deployment options • Software (Windows Server), Cloud (Windows Azure Cloud) Hadoop • Bundled distribution (Big Data Solution); Hive, Pig Links • Home page, case study Microsoft have adopted Hadoop as the center of their big data offering, and are pursuing an integrated approach aimed at making big data available through their analytical tool suite, including to the familiar tools of Excel and PowerPivot. Integrated Hadoop Systems | 159 Microsoft’s Big Data Solution brings Hadoop to the Windows Server platform, and in elastic form to their cloud platform Windows Azure. Microsoft have packaged their own distribution of Hadoop, integrated with Windows Systems Center and Active Directory. They intend to contribute back changes to Apache Hadoop to ensure that an open source version of Hadoop will run on Windows. On the server side, Microsoft offer integrations to their SQL Server database and their data warehouse product. Using their warehouse solutions aren’t mandated, however. The Hadoop Hive data warehouse is part of the Big Data Solution, including connectors from Hive to ODBC and Excel. Microsoft’s focus on the developer is evident in their creation of a JavaScript API for Hadoop. Using JavaScript, developers can create Hadoop jobs for MapReduce, Pig or Hive, even from a browser-based environment. Visual Studio and .NET integration with Hadoop is also provided. Deployment is possible either on the server or in the cloud, or as a hybrid combination. Jobs written against the Apache Hadoop distri‐ bution should migrate with miniminal changes to Microsoft’s envi‐ ronment. Oracle Oracle Big Data Deployment options • Appliance (Oracle Big Data Appliance) Hadoop • Bundled distribution (Cloudera’s Distribution including Apache Hadoop); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr NoSQL component • Oracle NoSQL Database Links 160 | Big Data Market Survey • Home page Announcing their entry into the big data market at the end of 2011, Oracle is taking an appliance-based approach. Their Big Data Appli‐ ance integrates Hadoop, R for analytics, a new Oracle NoSQL database, and connectors to Oracle’s database and Exadata data warehousing product line. Oracle’s approach caters to the high-end enterprise market, and par‐ ticularly leans to the rapid-deployment, high-performance end of the spectrum. It is the only vendor to include the popular R analytical language integrated with Hadoop, and to ship a NoSQL database of their own design as opposed to Hadoop HBase. Rather than developing their own Hadoop distribution, Oracle have partnered with Cloudera for Hadoop support, which brings them a mature and established Hadoop solution. Database connectors again promote the integration of structured Oracle data with the unstruc‐ tured data stored in Hadoop HDFS. Oracle’s NoSQL Database is a scalable key-value database, built on the Berkeley DB technology. In that, Oracle owes double gratitude to Cloudera CEO Mike Olson, as he was previously the CEO of Sleepycat, the creators of Berkeley DB. Oracle are positioning their NoSQL da‐ tabase as a means of acquiring big data prior to analysis. The Oracle R Enterprise product offers direct integration into the Oracle database, as well as Hadoop, enabling R scripts to run on data without having to round-trip it out of the data stores. Availability While IBM and Greenplum’s offerings are available at the time of writing, the Microsoft and Oracle solutions are expected to be fully available early in 2012. Analytical Databases with Hadoop Connectivity MPP (massively parallel processing) databases are specialized for pro‐ cessing structured big data, as distinct from the unstructured data that is Hadoop’s specialty. Along with Greenplum, Aster Data and Vertica Analytical Databases with Hadoop Connectivity | 161 are early pioneers of big data products before the mainstream emer‐ gence of Hadoop. These MPP solutions are databases specialized for analyical workloads and data integration, and provide connectors to Hadoop and data warehouses. A recent spate of acquisitions have seen these products become the analytical play by data warehouse and storage vendors: Teradata acquired Aster Data, EMC acquired Greenplum, and HP ac‐ quired Vertica. Quick facts Aster Data ParAccel Vertica Database * MPP analytical database Database * MPP analytical database Database * MPP analytical database Deployment options * Appliance (Aster MapReduce Appliance), Software (Enterprise Linux), Cloud (Amazon EC2, Terremark and Dell Clouds) Deployment options * Software (Enterprise Linux), Cloud (Cloud Edition) Deployment options * Appliance (HP Vertica Appliance), Software (Enterprise Linux), Cloud (Cloud and Virtualized) Hadoop * Hadoop connector available Hadoop * Hadoop integration available Hadoop * Hadoop and Pig connectors available Links * Home page Links * Home page, case study Links * Home page, case study Hadoop-Centered Companies Directly employing Hadoop is another route to creating a big data solution, especially where your infrastructure doesn’t fall neatly into the product line of major vendors. Practically every database now fea‐ tures Hadoop connectivity, and there are multiple Hadoop distribu‐ tions to choose from. Reflecting the developer-driven ethos of the big data world, Hadoop distributions are frequently offered in a community edition. Such ed‐ itions lack enterprise management features, but contain all the func‐ tionality needed for evaluation and development. The first iterations of Hadoop distributions, from Cloudera and IBM, focused on usability and adminstration. We are now seeing the addi‐ tion of performance-oriented improvements to Hadoop, such as those from MapR and Platform Computing. While maintaining API com‐ 162 | Big Data Market Survey patibility, these vendors replace slow or fragile parts of the Apache distribution with better performing or more robust components. Cloudera The longest-established provider of Hadoop distributions, Cloudera provides an enterprise Hadoop solution, alongside services, training and support options. Along with Yahoo, Cloudera have made deep open source contributions to Hadoop, and through hosting industry conferences have done much to establish Hadoop in its current posi‐ tion. Hortonworks Though a recent entrant to the market, Hortonworks have a long his‐ tory with Hadoop. Spun off from Yahoo, where Hadoop originated, Hortonworks aims to stick close to and promote the core Apache Ha‐ doop technology. Hortonworks also have a partnership with Microsoft to assist and accelerate their Hadoop integration. Hortonworks Data Platform is currently in a limited preview phase, with a public preview expected in early 2012. The company also pro‐ vides support and training. An overview of Hadoop distributions (part 1) Cloudera EMC Greenplum Hortonworks IBM Product Name Cloudera’s Distribution including Apache Hadoop Greenplum HD Hortonworks Data Platform InfoSphere BigInsights Free Edition CDH Integrated, tested distribution of Apache Hadoop Community Edition 100% open source certified and supported version of the Apache Hadoop stack Basic Edition An integrated Hadoop distribution. Hadoop-Centered Companies | 163 Cloudera EMC Greenplum Hortonworks IBM Enterprise Edition Cloudera Enterprise Adds management software layer over CDH Enterprise Edition Integrates MapR’s M5 Hadoop- compatible distribution, replaces HDFS with MapR’s C++- based file system. Includes MapR management tools Enterprise Edition Hadoop distribution, plus BigSheets spreadsheet interface, scheduler, text analytics, indexer, JDBC connector, security support. Hadoop Components Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr Hive, Pig, Zookeeper, HBase Hive, Pig, Zookeeper, HBase, None, Ambari Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene Security Cloudera Manager Kerberos, role- based administration and audit trails Security features LDAP authentication, role-based authorization, reverse proxy Admin Interface Cloudera Manager Centralized management and alerting Administrative interfaces MapR Heatmap cluster administrative tools Apache Ambari Monitoring, administration and lifecycle management for Hadoop clusters Administrative interfaces Administrative features including Hadoop HDFS and MapReduce administration, cluster and server management, view HDFS file content Job Management Cloudera Manager Job analytics, monitoring and log search High-availability job management JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents Apache Ambari Monitoring, administration and lifecycle management for Hadoop clusters Job management features Job creation, submission, cancellation, status, logging. Database connectors Greenplum Database DB2, Netezza, InfoSphere Warehouse Interop features 164 | Big Data Market Survey Cloudera EMC Greenplum Hortonworks IBM HDFS Access Fuse-DFS Mount HDFS as a traditional filesystem NFS Access HDFS as a conventional network file system WebHDFS REST API to HDFS Installation Cloudera Manager Wizard-based deployment Quick installation GUI-driven installation tool Additional APIs Jaql Jaql is a functional, declarative query language designed to process large data sets. Volume Management An overview of Hadoop distributions (part 2) MapR Microsoft Platform Computing Product Name MapR Big Data Solution Platform MapReduce Free Edition MapR M3 Edition Free community edition incorporating MapR’s performance increases Platform MapReduce Developer Edition Evaluation edition, excludes resource management features of regualt edition Enterprise Edition MapR M5 Edition Augments M3 Edition with high availability and data protection features Big Data Solution Windows Hadoop distribution, integrated with Microsoft’s database and analytical products Platform MapReduce Enhanced runtime for Hadoop MapReduce, API-compatible with Apache Hadoop Hadoop Components Hive, Pig, Flume, HBase, Sqoop, Mahout, None, Oozie Hive, Pig Security Active Directory integration Admin Interface Administrative interfaces MapR Heatmap cluster administrative tools System Center integration Administrative interfaces Platform MapReduce Workload Manager Hadoop-Centered Companies | 165 MapR Microsoft Platform Computing Job Management High-availability job management JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents Database connectors SQL Server, SQL Server Parallel Data Warehouse Interop features Hive ODBC Driver, Excel Hive Add-in HDFS Access NFS Access HDFS as a conventional network file system Installation Additional APIs REST API JavaScript API JavaScript Map/ Reduce jobs, Pig- Latin, and Hive queries Includes R, C/C++, C#, Java, Python Volume Management Mirroring, snapshots Notes • Pure cloud solutions: Both Amazon Web Services and Google offer cloud-based big data solutions. These will be reviewed separately. • HPCC: Though dominant, Hadoop is not the only big data solu‐ tion. LexisNexis’ HPCC offers an alternative approach. • Hadapt: not yet featured in this survey. Taking a different ap‐ proach from both Hadoop-centered and MPP solutions, Hadapt integrates unstructured and structured data into one product: wrapping rather than exposing Hadoop. It is currently in “early access” stage. • NoSQL: Solutions built on databases such as Cassandra, Mon‐ goDB and Couchbase are not in the scope of this survey, though these databases do offer Hadoop integration. • Errors and omissions: given the fast-evolving nature of the market and variable quality of public information, any feedback about 166 | Big Data Market Survey errors and omissions from this survey is most welcome. Please send it to edd+bigdata@oreilly.com. Notes | 167 168 | 168 Microsoft’s Plan for Big Data Edd Dumbill Microsoft has placed Apache Hadoop at the core of its big data strategy. It’s a move that might seem surprising to the casual observer, being a somewhat enthusiastic adoption of a significant open source product. The reason for this move is that Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By em‐ bracing Hadoop, Microsoft allows its customers to access the rapidly growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop savvy developers. Microsoft’s goals go beyond integrating Hadoop into Windows. It in‐ tends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows. Microsoft’s Hadoop Distribution The Microsoft distribution of Hadoop is currently in “Customer Tech‐ nology Preview” phase. This means it is undergoing evaluation in the field by groups of customers. The expected release time is toward the middle of 2012, but will be influenced by the results of the technology preview program. Microsoft’s Hadoop distribution is usable either on-premise with Windows Server, or in Microsoft’s cloud platform, Windows Azure. The core of the product is in the MapReduce, HDFS, Pig and Hive components of Hadoop. These are certain to ship in the 1.0 release. Microsoft’s Hadoop Distribution | 169 As Microsoft’s aim is for 100% Hadoop compatibility, it is likely that additional components of the Hadoop ecosystem such as Zookeeper, HBase, HCatalog and Mahout will also be shipped. Additional components integrate Hadoop with Microsoft’s ecosystem of business intelligence and analytical products: • Connectors for Hadoop, integrating it with SQL Server and SQL Sever Parallel Data Warehouse. • An ODBC driver for Hive, permitting any Windows application to access and run queries against the Hive data warehouse. • An Excel Hive Add-in, which enables the movement of data di‐ rectly from Hive into Excel or PowerPivot. On the back end, Microsoft offers Hadoop performance improve‐ ments, integration with Active Directory to facilitate access control, and with System Center for administration and management. Developers, Developers, Developers One of the most interesting features of Microsoft’s work with Hadoop is the addition of a JavaScript API. Working with Hadoop at a pro‐ grammatic level can be tedious: this is why higher-level languages such as Pig emerged. 170 | Microsoft’s Plan for Big Data Driven by its focus on the software developer as an important cus‐ tomer, Microsoft chose to add a JavaScript layer to the Hadoop eco‐ system. Developers can use it to create MapReduce jobs, and even interact with Pig and Hive from a browser environment. The real advantage of the JavaScript layer should show itself in inte‐ grating Hadoop into a business environment, making it easy for de‐ velopers to create intranet analytical environments accessible by busi‐ ness users. Combined with Microsoft’s focus on bringing server-side JavaScript to Windows and Azure through Node.js, this gives an in‐ teresting glimpse into Microsoft’s view of where developer enthusiasm and talent will lie. It’s also good news for the broader Hadoop community, as Microsoft intends to contribute its JavaScript API to the Apache Hadoop open source project itself. The other half of Microsoft’s software development environment is of course the .NET platform. With Microsoft’s Hadoop distribution, it will be possible to create MapReduce jobs from .NET, though using the Hadoop APIs directly. It is likely that higher-level interfaces will emerge in future releases. The same applies to Visual Studio, which over time will get increasing levels of Hadoop project support. Streaming Data and NoSQL Hadoop covers part of the big data problem, but what about streaming data processing or NoSQL databases? The answer comes in two parts, covering existing Microsoft products and future Hadoop-compatible solutions. Microsoft has some established products: Its streaming data solution called StreamInsight, and for NoSQL, Windows Azure has a product called Azure Tables. Looking to the future, the commitment of Hadoop compatibility means that streaming data solutions and NoSQL databases designed to be part of the Hadoop ecosystem should work with the Microsoft distribution—HBase itself will ship as a core offering. It seems likely that solutions such as S4 will prove compatible. Streaming Data and NoSQL | 171 Toward an Integrated Environment Now that Microsoft is on the way to integrating the major components of big data tooling, does it intend to join it all together to provide an integrated data science platform for businesses? That’s certainly the vision, according to Madhu Reddy, senior product planner for Microsoft Big Data: “Hadoop is primarily for developers. We want to enable people to use the tools they like.” The strategy to achieve this involves entry points at multiple levels: for developers, analysts and business users. Instead of choosing one par‐ ticular analytical platform of choice, Microsoft will focus on intero‐ perability with existing tools. Excel is an obvious priority, but other tools are also important to the company. According to Reddy, data scientists represent a spectrum of preferen‐ ces. While Excel is a ubiquitous and popular choice, other customers use Matlab, SAS, or R, for example. The Data Marketplace One thing unique to Microsoft as a big data and cloud platform is its data market, Windows Azure Marketplace. Mixing external data, such as geographical or social, with your own, can generate revealing in‐ sights. But it’s hard to find data, be confident of its quality, and pur‐ chase it conveniently. That’s where data marketplaces meet a need. The availability of the Azure marketplace integrated with Microsoft’s tools gives analysts a ready source of external data with some guaran‐ tees of quality. Marketplaces are in their infancy now, but will play a growing role in the future of data-driven business. Summary The Microsoft approach to big data has ensured the continuing rele‐ vance of its Windows platform for web-era organizations, and makes its cloud services a competitive choice for data-centered businesses. Appropriately enough for a company with a large and diverse software ecosystem of its own, the Microsoft approach is one of interoperability. Rather than laying out a golden path for big data, as suggested by the appliance-oriented approach of others, Microsoft is focusing heavily on integration. 172 | Microsoft’s Plan for Big Data The guarantee of this approach lies in Microsoft’s choice to embrace and work with the Apache Hadoop community, enabling the migra‐ tion of new tools and talented developers to its platform. Summary | 173 Big Data in the Cloud Edd Dumbill Big data and cloud technology go hand-in-hand. Big data needs clus‐ ters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both “cloud” and “big data” have broad definitions, obscured by con‐ siderable hype. This article breaks down the landscape as simply as possible, highlighting what’s practical, and what’s to come. IaaS and Private Clouds What is often called “cloud” amounts to virtualized servers: computing resource that presents itself as a regular server, rentable per consump‐ tion. This is generally called infrastructure as a service (IaaS), and is offered by platforms such as Rackspace Cloud or Amazon EC2. You buy time on these services, and install and configure your own soft‐ ware, such as a Hadoop cluster or NoSQL database. Most of the solu‐ tions I described in my Big Data Market Survey can be deployed on IaaS services. Using IaaS clouds doesn’t mean you must handle all deployment man‐ ually: good news for the clusters of machines big data requires. You can use orchestration frameworks, which handle the management of resources, and automated infrastructure tools, which handle server installation and configuration. RightScale offers a commercial multi- cloud management platform that mitigates some of the problems of managing servers in the cloud. Frameworks such as OpenStack and Eucalyptus aim to present a uni‐ form interface to both private data centers and the public cloud. At‐ tracting a strong flow of cross industry support, OpenStack currently 175 addresses computing resource (akin to Amazon’s EC2) and storage (parallels Amazon S3). The race is on to make private clouds and IaaS services more usable: over the next two years using clouds should become much more straightforward as vendors adopt the nascent standards. There’ll be a uniform interface, whether you’re using public or private cloud facili‐ ties, or a hybrid of the two. Particular to big data, several configuration tools already target Ha‐ doop explicitly: among them Dell’s Crowbar, which aims to make de‐ ploying and configuring clusters simple, and Apache Whirr, which is specialized for running Hadoop services and other clustered data pro‐ cessing systems. Today, using IaaS gives you a broad choice of cloud supplier, the option of using a private cloud, and complete control: but you’ll be respon‐ sible for deploying, managing and maintaining your clusters. Platform solutions Using IaaS only brings you so far for with big data applications: they handle the creation of computing and storage resources, but don’t ad‐ dress anything at a higher level. The set up of Hadoop and Hive or a similar solution is down to you. Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to con‐ figure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer. The general PaaS market is burgeoning, with major players including VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data and machine learning requirements percolate through the in‐ dustry, these players are likely to add their own big-data-specific serv‐ ices. For the purposes of this article, though, I will be sticking to the vendors who already have implemented big data solutions. Today’s primary providers of such big data platform services are Am‐ azon, Google and Microsoft. You can see their offerings summarized in the table toward the end of this article. Both Amazon Web Services 176 | Big Data in the Cloud and Microsoft’s Azure blur the lines between infrastructure as a service and platform: you can mix and match. By contrast, Google’s philoso‐ phy is to skip the notion of a server altogether, and focus only on the concept of the application. Among these, only Amazon can lay claim to extensive experience with their product. Amazon Web Services Amazon has significant experience in hosting big data processing. Use of Amazon EC2 for Hadoop was a popular and natural move for many early adopters of big data, thanks to Amazon’s expandable supply of compute power. Building on this, Amazon launched Elastic Map Re‐ duce in 2009, providing a hosted, scalable Hadoop service. Applications on Amazon’s platform can pick from the best of both the IaaS and PaaS worlds. General purpose EC2 servers host applications that can then access the appropriate special purpose managed solu‐ tions provided by Amazon. As well as Elastic Map Reduce, Amazon offers several other services relevant to big data, such as the Simple Queue Service for coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon’s High Performance Computing solutions are tuned for low-latency cluster computing, of the sort re‐ quired by scientific and engineering applications. Elastic Map Reduce Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways, through Pig, Hive or other programming language, and uses Amazon’s S3 storage service to get data in and out. Access to Elastic Map Reduce is through Amazon’s SDKs and tools, or with GUI analytical and IDE products such as those offered by Kar‐ masphere. In conjunction with these tools, EMR represents a strong option for experimental and analytical work. Amazon’s EMR pricing makes it a much more attractive option to use EMR, rather than con‐ figure EC2 instances yourself to run Hadoop. When integrating Hadoop with applications generating structured data, using S3 as the main data source can be unwieldy. This is because, similar to Hadoop’s HDFS, S3 works at the level of storing blobs of opaque data. Hadoop’s answer to this is HBase, a NoSQL database that integrates with the rest of the Hadoop stack. Unfortunately, Amazon does not currently offer HBase with Elastic Map Reduce. Platform solutions | 177 DynamoDB Instead of HBase, Amazon provides DynamoDB, its own managed, scalable NoSQL database. As this a managed solution, it represents a better choice than running your own database on top of EC2, in terms of both performance and economy. DynamoDB data can be exported to and imported from S3, providing interoperability with EMR. Google Google’s cloud platform stands out as distinct from its competitors. Rather than offering virtualization, it provides an application con‐ tainer with defined APIs and services. Developers do not need to con‐ cern themselves with the concept of machines: applications execute in the cloud, getting access to as much processing power as they need, within defined resource usage limits. To use Google’s platform, you must work within the constraints of its APIs. However, if that fits, you can reap the benefits of the security, tuning and performance improvements inherent to the way Google develops all its services. AppEngine, Google’s cloud application hosting service, offers a Map‐ Reduce facility for parallel computation over data, but this is more of a feature for use as part of complex applications rather than for ana‐ lytical purposes. Instead, BigQuery and the Prediction API form the core of Google’s big data offering, respectively offering analysis and machine learning facilities. Both these services are available exclu‐ sively via REST APIs, consistent with Google’s vision for web-based computing. BigQuery BigQuery is an analytical database, suitable for interactive analysis over datasets of the order of 1TB. It works best on a small number of tables with a large number of rows. BigQuery offers a familiar SQL interface to its data. In that, it is comparable to Apache Hive, but the typical performance is faster, making BigQuery a good choice for ex‐ ploratory data analysis. Getting data into BigQuery is a matter of directly uploading it, or im‐ porting it from Google’s Cloud Storage system. This is the aspect of BigQuery with the biggest room for improvement. Whereas Amazon’s S3 lets you mail in disks for import, Google doesn’t currently have this 178 | Big Data in the Cloud facility. Streaming data into BigQuery isn’t viable either, so regular imports are required for constantly updating data. Finally, as BigQuery only accepts data formatted as comma-separated value (CSV) files, you will need to use external methods to clean up the data beforehand. Rather than provide end-user interfaces itself, Google wants an eco‐ system to grow around BigQuery, with vendors incorporating it into their products, in the same way Elastic Map Reduce has acquired tool integration. Currently in beta test, to which anybody can apply, Big‐ Query is expected to be publicly available during 2012. Prediction API Many uses of machine learning are well defined, such as classification, sentiment analysis, or recommendation generation. To meet these needs, Google offers its Prediction API product. Applications using the Prediction API work by creating and training a model hosted within Google’s system. Once trained, this model can be used to make predictions, such as spam detection. Google is work‐ ing on allowing these models to be shared, optionally with a fee. This will let you take advantage of previously trained models, which in many cases will save you time and expertise with training. Though promising, Google’s offerings are in their early days. Further integration between its services is required, as well as time for ecosys‐ tem development to make their tools more approachable. Microsoft I have written in some detail about Microsoft’s big data strategy in Microsoft’s plan for Hadoop and big data. By offering its data platforms on Windows Azure in addition to Windows Server, Microsoft’s aim is to make either on-premise or cloud-based deployments equally viable with its technology. Azure parallels Amazon’s web service offerings in many ways, offering a mix of IaaS services with managed applications such as SQL Server. Hadoop is the central pillar of Microsoft’s big data approach, surroun‐ ded by the ecosystem of its own database and business intelligence tools. For organizations already invested in the Microsoft platform, Azure will represent the smoothest route for integrating big data into the operation. Azure itself is pragmatic about language choice, sup‐ porting technologies such as Java, PHP and Node.js in addition to Microsoft’s own. Platform solutions | 179 As with Google’s BigQuery, Microsoft’s Hadoop solution is currently in closed beta test, and is expected to be generally available sometime in the middle of 2012. Big data cloud platforms compared The following table summarizes the data storage and analysis capa‐ bilities of Amazon, Google and Microsoft’s cloud platforms. Inten‐ tionally excluded are IaaS solutions without dedicated big data offer‐ ings. Amazon Google Microsoft Product(s) Amazon Web Services Google Cloud Services Windows Azure Big data storage S3 Cloud Storage HDFS on Azure Working storage Elastic Block Store AppEngine (Datastore, Blobstore) Blob, table, queues NoSQL database DynamoDB 1 AppEngine Datastore Table storage Relational database Relational Database Service (MySQL or Oracle) Cloud SQL (MySQL compatible) SQL Azure Application hosting EC2 AppEngine Azure Compute Map/Reduce service Elastic MapReduce (Hadoop) AppEngine (limited capacity) Hadoop on Azure 2 Big data analytics Elastic MapReduce (Hadoop interface3) BigQuery 2 (TB-scale, SQL interface) Hadoop on Azure (Hadoop interface3) Machine learning Via Hadoop + Mahout on EMR or EC2 Prediction API Mahout with Hadoop Streaming processing Nothing prepackaged: use custom solution on EC2 Prospective Search API 4 StreamInsight 2 (“Project Austin”) Data import Network, physically ship drives Network Network Data sources Public Data Sets A few sample datasets Windows Azure Marketplace Availability Public production Some services in private beta Some services in private beta Conclusion Cloud-based big data services offer considerable advantages in re‐ moving the overhead of configuring and tuning your own clusters, and in ensuring you pay only for what you use. The biggest issue is 180 | Big Data in the Cloud always going to be data locality, as it is slow and expensive to ship data. The most effective big data cloud solutions will be the ones where the data is also collected in the cloud. This is an incentive to investigate EC2, Azure or AppEngine as a primary application platform, and an indicator that PaaS competitors such as Cloud Foundry and Heroku will have to address big data as a priority. It is early days yet for big data in the cloud, with only Amazon offering battle-tested solutions at this point. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next two years. However, the twin advantages of not having to worry about infra‐ structure and economies of scale mean it is well worth investigating cloud services for your big data needs, especially for an experimental or green-field project. Looking to the future, there’s no doubt that big data analytical capability will form an essential component of utility computing solutions. Notes 1 In public beta. 2 In controlled beta test. 3 Hive and Pig compatible. 4 Experimental status. Notes | 181 Data Marketplaces Edd Dumbill The sale of data is a venerable business, and has existed since the mid‐ dle of the 19th century, when Paul Reuter began providing telegraphed stock exchange prices between Paris and London, and New York newspapers founded the Associated Press. The web has facilitated a blossoming of information providers. As the ability to discover and exchange data improves, the need to rely on aggregators such as Bloomberg or Thomson Reuters is declining. This is a good thing: the business models of large aggregators do not readily scale to web startups, or casual use of data in analytics. Instead, data is increasingly offered through online marketplaces: platforms that host data from publishers and offer it to consumers. This article provides an overview of the most mature data markets, and contrasts their different approaches and facilities. What Do Marketplaces Do? Most of the consumers of data from today’s marketplaces are devel‐ opers. By adding another dataset to your own business data, you can create insight. To take an example from web analytics: by mixing an IP address database with the logs from your website, you can under‐ stand where your customers are coming from, then if you add dem‐ ographic data to the mix, you have some idea of their socio-economic bracket and spending ability. Such insight isn’t limited to analytic use only, you can use it to provide value back to a customer. For instance, by recommending restaurants local to the vicinity of a lunchtime appointment in their calendar. 183 While many datasets are useful, few are as potent as that of location in the way they provide context to activity. Marketplaces are useful in three major ways. First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume. In general, one of the important barriers to the development of the data marketplace economy is the ability of enterprises to store and make use of the data. A principle of big data is that it’s often easier to move your computation to the data, rather than the reverse. Because of this, we’re seeing the increasing integration between cloud com‐ puting facilities and data markets: Microsoft’s data market is tied to its Azure cloud, and Infochimps offers hosted compute facilities. In short-term cases, it’s probably easier to export data from your business systems to a cloud platform than to try and expand internal systems to integrate external sources. While cloud solutions offer a route forward, some marketplaces also make the effort to target end-users. Microsoft’s data marketplace can be accessed directly through Excel, and DataMarket provides online visualization and exploration tools. The four most established data marketplaces are Infochimps, Factual, Microsoft Windows Azure Data Marketplace, and DataMarket. A table comparing these providers is presented at the end of this article, and a brief discussion of each marketplace follows. Infochimps According to founder Flip Kromer, Infochimps was created to give data life in the same way that code hosting projects such as Source‐ Forge or GitHub give life to code. You can improve code and share it: Kromer wanted the same for data. The driving goal behind Infochimps is to connect every public and commercially available database in the world to a common platform. Infochimps realized that there’s an important network effect of “data with the data,” that the best way to build a data commons and a data marketplace is to put them together in the same place. The proximity 184 | Data Marketplaces of other data makes all the data more valuable, because of the ease with which it can be found and combined. The biggest challenge in the two years Infochimps has been operating is that of bootstrapping: a data market needs both supply and demand. Infochimps’ approach is to go for a broad horizontal range of data, rather than specialize. According to Kromer, this is because they view data’s value as being in the context it provides: in giving users more insight about their own data. To join up data points into a context, common identities are required (for example, a web page view can be given a geographical location by joining up the IP address of the page request with that from the IP address in an IP intelligence database). The benefit of common identities and data integration is where host‐ ing data together really shines, as Infochimps only needs to integrate the data once for customers to reap continued benefit: Infochimps sells datasets which are pre-cleaned and integrated mash-ups of those from their providers. By launching a big data cloud hosting platform alongside its market‐ place, Infochimps is seeking to build on the importance of data locality. Factual Factual was envisioned by founder and CEO Gil Elbaz as an open data platform, with tools that could be leveraged by community contribu‐ tors to improve data quality. The vision is very similar to that of In‐ fochimps, but in late 2010 Factual elected to concentrate on one area of the market: geographical and place data. Rather than pursue a broad strategy, the idea is to become a proven and trusted supplier in one vertical, then expand. With customers such as Facebook, Factual’s strategy is paying off. According to Elbaz, Factual will look to expand into verticals other than local information in 2012. It is moving one vertical at a time due to the marketing effort required in building quality community and relationships around the data. Unlike the other main data markets, Factual does not offer reselling facilities for data publishers. Elbaz hasn’t found that the cash on offer is attractive enough for many organizations to want to share their data. Instead, he believes that the best way to get data you want is to trade other data, which could provide business value far beyond the returns of publishing data in exchange for cash. Factual offer incentives to Factual | 185 their customers to share data back, improving the quality of the data for everybody. Windows Azure Data Marketplace Launched in 2010, Microsoft’s Windows Azure Data Marketplace sits alongside the company’s Applications marketplace as part of the Azure cloud platform. Microsoft’s data market is positioned with a very strong integration story, both at the cloud level and with end-user tooling. Through use of a standard data protocol, OData, Microsoft offers a well-defined web interface for data access, including queries. As a re‐ sult, programs such as Excel and PowerPivot can directly access mar‐ ketplace data: giving Microsoft a strong capability to integrate external data into the existing tooling of the enterprise. In addition, OData support is available for a broad array of programming languages. Azure Data Marketplace has a strong emphasis on connecting data consumers to publishers, and most closely approximates the popular concept of an “iTunes for Data.” Big name data suppliers such as Dun & Bradstreet and ESRI can be found among the publishers. The mar‐ ketplace contains a good range of data across many commercial use cases, and tends to be limited to one provider per dataset—Microsoft has maintained a strong filter on the reliability and reputation of its suppliers. DataMarket Where the other three main data marketplaces put a strong focus on the developer and IT customers, DataMarket caters to the end-user as well. Realizing that interacting with bland tables wasn’t engaging users, founder Hjalmar Gislason worked to add interactive visualization to his platform. The result is a data marketplace that is immediately useful for re‐ searchers and analysts. The range of DataMarket’s data follows this audience too, with a strong emphasis on country data and economic indicators. Much of the data is available for free, with premium data paid at the point of use. DataMarket has recently made a significant play for data publishers, with the emphasis on publishing, not just selling data. Through a va‐ 186 | Data Marketplaces riety of plans, customers can use DataMarket’s platform to publish and sell their data, and embed charts in their own pages. At the enterprise end of their packages, DataMarket offers an interactive branded data portal integrated with the publisher’s own web site and user authen‐ tication system. Initial customers of this plan include Yankee Group and Lux Research. Data Markets Compared Azure Datamarket Factual Infochimps Data sources Broad range Range, with a focus on country and industry stats Geo-specialized, some other datasets Range, with a focus on geo, social and web sources Free data Yes Yes - Yes Free trials of paid data Yes - Yes, limited free use of APIs - Delivery OData API API, downloads API, downloads for heavy users API, downloads Application hosting Windows Azure - - Infochimps Platform Previewing Service Explorer Interactive visualization Interactive search - Tool integration Excel, PowerPivot, Tableau and other OData consumers - Developer tool integrations - Data publishing Via database connection or web service Upload or web/ database connection. Via upload or web service. Upload Data reselling Yes, 20% commission on non-free datasets Yes. Fees and commissions vary. Ability to create branded data market - Yes. 30% commission on non-free datasets. Launched 2010 2010 2007 2009 Other Data Suppliers While this article has focused on the more general purpose market‐ places, several other data suppliers are worthy of note. Data Markets Compared | 187 Social data—Gnip and Datasift specialize in offering social media data streams, in particular Twitter. Linked data—Kasabi, currently in beta, is a marketplace that is dis‐ tinctive for hosting all its data as Linked Data, accessible via web standards such as SPARQL and RDF. Wolfram Alpha—Perhaps the most prolific integrator of diverse data‐ bases, Wolfram Alpha recently added a Pro subscription level that permits the end user to download the data resulting from a compu‐ tation. Mike Loukides is Vice President of Content Strategy for O’Reilly Media, Inc. He’s edited many highly regarded books on technical subjects that don’t involve Windows programming. He’s particularly interested in pro‐ gramming languages, Unix and what passes for Unix these days, and system and network administration. Mike is the author of “System Per‐ formance Tuning”, and a coauthor of “Unix Power Tools.” Most recently, he’s been fooling around with data and data analysis, languages like R, Mathematica, and Octave, and thinking about how to make books social. 188 | Data Marketplaces The NoSQL Movement Mike Loukides In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology. This description immediately felt right; I’ve never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally dif‐ ferent as Neo4J (a graph database) and FluidDB (which defies de‐ scription). But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don’t see picketers outside Oracle’s head‐ quarters. Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the move‐ ment. Think of the last 15 years of software development. We’ve gotten very good at building large, database-backed applications. Many of them are web applications, but even more of them aren’t. “Software archi‐ tect” is a valid job description; it’s a position to which many aspire. But what do software architects do? They specify the high level design of applications: the front end, the APIs, the middleware, the business logic—the back end? Well, maybe not. Since the 80s, the dominant back end of business systems has been a relational database, whether Oracle, SQL Server or DB2. That’s not much of an architectural choice. Those are all great products, but they’re essentially similar, as are all the other relational databases. And it’s remarkable that we’ve explored many architectural variations in 189 the design of clients, front ends, and middleware, on a multitude of platforms and frameworks, but haven’t until recently questioned the architecture of the back end. Relational databases have been a given. Many things have changed since the advent of relational databases: • We’re dealing with much more data. Although advances in storage capacity and CPU speed have allowed the databases to keep pace, we’re in a new era where size itself is an important part of the problem, and any significant database needs to be distributed. • We require sub-second responses to queries. In the 80s, most da‐ tabase queries could run overnight as batch jobs. That’s no longer acceptable. While some analytic functions can still run as over‐ night batch jobs, we’ve seen the Web evolve from static files to complex database-backed sites, and that requires sub-second re‐ sponse times for most queries. • We want applications to be up 24/7. Setting up redundant servers for static HTML files is easy, but a database replication in a com‐ plex database-backed application is another. • We’re seeing many applications in which the database has to soak up data as fast (or even much faster) than it processes queries: in a logging application, or a distributed sensor application, writes can be much more frequent than reads. Batch-oriented ETL (ex‐ tract, transform, and load) hasn’t disappeared, and won’t, but cap‐ turing high speed data flows is increasingly important. • We’re frequently dealing with changing data or with unstructured data. The data we collect, and how we use it, grows over time in unpredictable ways. Unstructured data isn’t a particularly new feature of the data landscape, since unstructured data has always existed, but we’re increasingly unwilling to force a structure on data a priori. • We’re willing to sacrifice our sacred cows. We know that consis‐ tency and isolation and other properties are very valuable, of course. But so are some other things, like latency and availability and not losing data even if our primary server goes down. The challenges of modern applications make us realize that sometimes we might need to weaken one of these constraints in order to achieve another. These changing requirements lead us to different tradeoffs and com‐ promises when designing software. They require us to rethink what 190 | The NoSQL Movement we require of a database, and to come up with answers aside from the relational databases that have served us well over the years. So let’s look at these requirements in somewhat more detail. Size, Response, Availability It’s a given that any modern application is going to be distributed. The size of modern datasets is only one reason for distribution, and not the most important. Modern applications (particularly web applica‐ tions) have many concurrent users who demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related Changes and their User Impact, Eric Schurman and Jake Brutlag showed results from independent research projects at Google and Mi‐ crosoft. Both projects demonstrated imperceptibly small increases in response time cause users to move to another site; if response time is over a second, you’re losing a very measurable percentage of your traffic. If you’re not building a web application—say you’re doing business analytics, with complex, time-consumming queries—the world has changed, and users now expect business analytics to run in something like real time. Maybe not the sub-second latency required for web users, but queries that run overnight are no longer acceptable. Queries that run while you go out for coffee are marginal. It’s not just a matter of convenience; the ability to run dozens or hundreds of queries per day changes the nature of the work you do. You can be more experi‐ mental: you can follow through on hunches and hints based on earlier queries. That kind of spontaneity was impossible when research went through the DBA at the data warehouse. Whether you’re building a customer-facing application or doing in‐ ternal analytics, scalability is a big issue. Vertical scalability (buy a bigger, faster machine) always runs into limits. Now that the laws of physics have stalled Intel-architecture clock speeds in the 3.5GHz range, those limits are more apparent than ever. Horizontal scalability (build a distributed system with more nodes) is the only way to scale indefinitely. You’re scaling horizontally even if you’re only buying sin‐ gle boxes: it’s been a long time since I’ve seen a server (or even a high- end desktop) that doesn’t sport at least four cores. Horizontal scala‐ bility is tougher when you’re scaling across racks of servers at a colo‐ cation facility, but don’t be deceived: that’s how scalability works in the 21st century, even on your laptop. Even in your cell phone. We need Size, Response, Availability | 191 database technologies that aren’t just fast on single servers: they must also scale across multiple servers. Modern applications also need to be highly available. That goes without saying, but think about how the meaning of “availability” has changed over the years. Not much more than a decade ago, a web application would have a single HTTP server that handed out static files. These applications might be data-driven; but “data driven” meant that a batch job rebuilt the web site overnight, and user transactions were queued into a batch processing system, again for processing overnight. Keeping such a system running isn’t terribly difficult. High availability doesn’t impact the database: if the database is only engaged in batched rebuilds or transaction processing, the database can crash without damage. That’s the world for which relational databases were designed: in the 80s, if your mainframe ran out of steam, you got a bigger one. If it crashed, you were down. But when databases became a living, breathing part of the application, availability became an issue. There is no way to make a single system highly available; as soon as any component fails, you’re toast. Highly available systems are, by na‐ ture, distributed systems. If a distributed database is a given, the next question is how much work a distributed system will require. There are fundamentally two op‐ tions: databases that have to be distributed manually, via sharding; and databases that are inherently distributed. Relational databases are split between multiple hosts by manual sharding, or determining how to partition the datasets based on some properties of the data itself: for example, first names starting with A-K on one server, L-Z on another. A lot of thought goes into designing a sharding and replication strategy that doesn’t impair performance, while keeping the data relatively bal‐ anced between servers. There’s a third option which is essentially a hybrid: databases that are not inherently distributed, but that are de‐ signed so they can be partitioned easily. MongoDB is an example of a database that can be sharded easily (or even automatically); HBase, Riak, and Cassandra are all inherently distributed, with options to control how replication and distribution work. What database choices are viable when you need good interactive re‐ sponse? There are two separate issues: read latency and write latency. For reasonably simple queries on a database with well-designed in‐ dexes, almost any modern database can give decent read latency, even at reasonably large scale. Similarly, just about all modern databases claim to be able to keep up with writes at high-speed. Most of these 192 | The NoSQL Movement databases, including HBase, Cassandra, Riak, and CouchDB, write data immediately to an append-only file, which is an extremely effi‐ cient operation. As a result, writes are often significantly faster than reads. Whether any particular database can deliver the performance you need depends on the nature of the application, and whether you’ve designed the application in a way that uses the database efficiently: in particular, the structure of queries, more than the structure of the data itself. Redis is an in-memory database with extremely fast response, for both read and write operations; but there are a number of tradeoffs. By default, data isn’t saved to disk, and is lost if the system crashes. You can configure Redis for durability, but at the cost of some performance. Redis is also limited in scalability; there’s some replication capability, but support for clusters is still coming. But if you want raw speed, and have a dataset that can fit into memory, Redis is a great choice. It would be nice if there were some benchmarks to cover database performance in a meaningful sense, but as the saying goes, “there are lies, damned lies, and benchmarks.” In particular, no small benchmark can properly duplicate a real test-case for an application that might reasonably involve dozens (or hundreds) of servers. Changing Data and Cheap Lunches NoSQL databases are frequently called “schemaless,” because they don’t have the formal schema associated with relational databases. The lack of a formal schema, which typically has to be designed before any code is written, means that schemaless databases are a better fit for current software development practices, such as agile development. Starting from the simplest thing that could possibly work and iterating quickly in response to customer input doesn’t fit well with designing an all-encompassing data schema at the start of the project. It’s im‐ possible to predict how data will be used, or what additional data you’ll need as the project unfolds. For example, many applications are now annotating their data with geographic information: latitudes and lon‐ gitudes, addresses. That almost certainly wasn’t part of the initial data design. How will the data we collect change in the future? Will we be collecting biometric information along with tweets and foursquare checkins? Will music sites such as Last.FM and Spotify incorporate factors like blood pressure into their music selection algorithms? If you think Changing Data and Cheap Lunches | 193 these scenarios are futuristic, think about Twitter. When it started out, it just collected bare-bones information with each tweet: the tweet itself, the Twitter handle, a timestamp, and a few other bits. Over its five year history, though, lots of metadata has been added: a tweet may be 140 characters at most, but a couple KB is actually sent to the server, and all of this is saved in the database. Up-front schema design is a poor fit in a world where data requirements are fluid. In addition, modern applications frequently deal with unstructured data: blog posts, web pages, voice transcripts, and other data objects that are essentially text. O’Reilly maintains a substantial database of job listings for some internal research projects. The job descriptions are chunks of text in natural languages. They’re not unstructured be‐ cause they don’t fit into a schema. You can easily create a JOBDE‐ SCRIPTION column in a table, and stuff text strings into it. It’s that knowing the data type and where it fits in the overall structure doesn’t help. What are the questions you’re likely to ask? Do you want to know about skills, certifications, the employer’s address, the employer’s in‐ dustry? Those are all valid columns for a table, but you don’t know what you care about in advance; you won’t find equivalent information in each job description; and the only way to get from the text to the data is through various forms of pattern matching and classification. Doing the classification up front, so you could break a job listing down into skills, certifications, etc., is a huge effort that would largely be wasted. The guys who work with this data recently had fits disambig‐ uating “Apple Computer” from “apple orchard”; would you even know this was a problem outside of a concrete research project based on a concrete question? If you’re just pre-populating an INDUSTRY col‐ umn from raw data, would you notice that lots of computer industry jobs were leaking into fruit farming? A JOBDESCRIPTION column doesn’t hurt, but doesn’t help much either; and going further, by trying to design a schema around the data that you’ll find in the unstructured text, definitely hurts. The kinds of questions you’re likely to ask have everything to do with the data itself, and little to do with that data’s relations to other data. However, it’s really a mistake to say that NoSQL databases have no schema. In a document database, such as CouchDB or MongoDB, documents are key-value pairs. While you can add documents with differing sets of keys (missing keys or extra keys), or even add keys to documents over time, applications still must know that certain keys are present to query the database; indexes have to be set up to make 194 | The NoSQL Movement searches efficient. The same thing applies to column-oriented data‐ bases, such as HBase and Cassandra. While any row may have as many columns as needed, some up-front thought has to go into what col‐ umns are needed to organize the data. In most applications, a NoSQL database will require less up-front planning, and offer more flexiblity as the application evolves. As we’ll see, data design revolves more around the queries you want to ask than the domain objects that the data represents. It’s not a free lunch; possibly a cheap lunch, but not free. What kinds of storage models do the more common NoSQL databases support? Redis is a relatively simple key-value store, but with a twist: values can be data structures (lists and sets), not just strings. It supplies operations for working directly with sets and lists (for example, union and intersection). CouchDB and MongoDB both store documents in JSON format, where JSON is a format originally designed for representating Java‐ Script objects, but now available in many languages. So on one hand, you can think of CouchDB and MongoDB as object databases; but you could also think of a JSON document as a list of key-value pairs. Any document can contain any set of keys, and any key can be associated with an arbitrarily complex value that is itself a JSON document. CouchDB queries are views, which are themselves documents in the database that specify searches. Views can be very complex, and can use a built-in mapreduce facility to process and summarize results. Similarly, MongoDB queries are JSON documents, specifying fields and values to match, and query results can be processed by a builtin mapreduce. To use either database effectively, you start by designing your views: what do you want to query, and how. Once you do that, it will become clear what keys are needed in your documents. Riak can also be viewed as a document database, though with more flexibility about document types: it natively handles JSON, XML, and plain text, and a plug-in architecture allows you to add support for other document types. Searches “know about” the structure of JSON and XML documents. Like CouchDB, Riak incorporates mapreduce to perform complex queries efficiently. Cassandra and HBase are usually called column-oriented databases, though a better term is a “sparse row store.” In these databases, the equivalent to a relational “table” is a set of rows, identified by a key. Each row consists of an unlimited number of columns; columns are Changing Data and Cheap Lunches | 195 essentially keys that let you look up values in the row. Columns can be added at any time, and columns that are unused in a given row don’t occupy any storage. NULLs don’t exist. And since columns are stored contiguously, and tend to have similar data, compression can be very efficient, and searches along a column are likewise efficient. HBase describes itself as a database that can store billions of rows with mil‐ lions of columns. How do you design a schema for a database like this? As with the document databases, your starting point should be the queries you’ll want to make. There are some radically different possibilities. Con‐ sider storing logs from a web server. You may want to look up the IP addresses that accessed each URL you serve. The URLs can be the primary key; each IP address can be a column. This approach will quickly generate thousands of unique columns, but that’s not a prob‐ lem—and a single query, with no joins, gets you all the IP addresses that accessed a single URL. If some URLs are visited by many ad‐ dresses, and some are only visited by a few, that’s no problem: remem‐ ber that NULLs don’t exist. This design isn’t even conceivable in a relational database: you can’t have a table that doesn’t have a fixed number of columns. Now, let’s make it more complex: you’re writing an ecommerce appli‐ cation, and you’d like to access all the purchases that a given customer has made. The solution is similar: the column family is organized by customer ID (primary key), you have columns for first name, last name, address, and all the normal customer information, plus as many rows as are needed for each purchase. In a relational database, this would probably involve several tables and joins; in the NoSQL data‐ bases, it’s a single lookup. Schema design doesn’t go away, but it changes: you think about the queries you’d like to execute, and how you can perform those efficiently. This isn’t to say that there’s no value to normalization, just that data design starts from a different place. With a relational database, you start with the domain objects, and represent them in a way that guar‐ antees that virtually any query can be expressed. But when you need to optimize performance, you look at the queries you actually perform, then merge tables to create longer rows, and do away with joins wher‐ ever possible. With the schemaless databases, whether we’re talking about data structure servers, document databases, or column stores, you go in the other direction: you start with the query, and use that to define your data objects. 196 | The NoSQL Movement The Sacred Cows The ACID properties (atomicity, consistency, isolation, durability) have been drilled into our heads. But even these come into play as we start thinking seriously about database architecture. When a database is distributed, for instance, it becomes much more difficult to achieve the same kind of consistency or isolation that you can on a single machine. And the problem isn’t just that it’s “difficult” but rather that achieving them ends up in direct conflict with some of the reasons to go distributed. It’s not that properties like these aren’t very important —they certainly are—but today’s software architects are discovering that they require the freedom to choose when it might be worth a compromise. What about transactions, two-phase commit, and other mechanisms inherited from big iron legacy databases? If you’ve read almost any discussion of concurrent or distributed systems, you’ve heard that banking systems care a lot about consistency: what if you and your spouse withdraw money from the same account at the same time? Could you overdraw the account? That’s what ACID is supposed to prevent. But a few months ago, I was talking to someone who builds banking software, and he said “If you really waited for each transaction to be properly committed on a world-wide network of ATMs, trans‐ actions would take so long to complete that customers would walk away in frustration. What happens if you and your spouse withdraw money at the same time and overdraw the account? You both get the money; we fix it up later.” This isn’t to say that bankers have discarded transactions, two-phase commit and other database techniques; they’re just smarter about it. In particular, they’re distinguishing between local consistency and ab‐ solutely global consistency. Gregor Hohpe’s classic article Starbucks Does Not Use Two-Phase Commit makes a similar point: in an asy‐ chronous world, we have many strategies for dealing with transac‐ tional errors, including write-offs. None of these strategies are any‐ thing like two-phase commit; they don’t force the world into inflexible, serialized patterns. The CAP theorem is more than a sacred cow; it’s a law of the Database universe that can be expressed as “Consistency, Availability, Partition Tolerance: choose two.” But let’s rethink relational databases in light of this theorem. Databases have stressed consistency. The CAP theo‐ rem is really about distributed systems, and as we’ve seen, relational The Sacred Cows | 197 databases were developed when distributed systems were rare and exotic at best. If you needed more power, you bought a bigger main‐ frame. Availability isn’t an issue on a single server: if it’s up, it’s up, if it’s down, it’s down. And partition tolerance is meaningless when there’s nothing to partition. As we saw at the beginning of this article, distributed systems are a given for modern applications; you won’t be able to scale to the size and performance you need on a single box. So the CAP theorem is historically irrelevant to relational databases: they’re good at providing consistency, and they have been adapted to provide high availability with some success, but they are hard to par‐ tition without extreme effort or extreme cost. Since partition tolerance is a fundamental requirement for distributed applications, it becomes a question of what to sacrifice: consistency or availability. There have been two approaches: Riak and Cassandra stress availability, while HBase has stressed consistency. With Cassan‐ dra and Riak, the tradeoff between consistency and availability is tuneable. CouchDB and MongoDB are essentially single-headed da‐ tabases, and from that standpoint, availability is a function of how long you can keep the hardware running. However, both have add-ons that can be used to build clusters. In a cluster, CouchDB and MongoDB are eventually consistent (like Riak and Cassandra); availability de‐ pends on what you do with the tools they provide. You need to set up sharding and replication, and use what’s essentially a proxy server to present a single interface to cluster’s clients. BigCouch is an interesting effort to integrate clustering into CouchDB, making it more like Riak. Now that Cloudant has announced that it is merging BigCouch and CouchDB, we can expect to see clustering become part of the CouchDB core. We’ve seen that absolute consistency isn’t a hard requirement for banks, nor is it the way we behave in our real-world interactions. Should we expect it of our software? Or do we care more about avail‐ ability? It depends; the consistency requirements of many social applications are very soft. You don’t need to get the correct number of Twitter or Facebook followers every time you log in. If you search, you probably don’t care if the results don’t contain the comments that were posted a few seconds ago. And if you’re willing to accept less-than-perfect consistency, you can make huge improvements in performance. In the world of big-data-backed web applications, with databases spread across hundreds (or potentially thousands) of nodes, the performance 198 | The NoSQL Movement penalty of locking down a database while you add or modify a row is huge; if your application has frequent writes, you’re effectively serial‐ izing all the writes and losing the advantage of the distributed database. In practice, in an “eventually consistent” database, changes typically propagate to the nodes in tenths of a second; we’re not talking minutes or hours before the database arrives in a consistent state. Given that we have all been battered with talk about “five nines” reli‐ ability, and given that it is a big problem for any significant site to be down, it seems clear that we should prioritize availability over con‐ sistency, right? The architectural decision isn’t so easy, though. There are many applications in which inconsistency must eventually be dealt with. If consistency isn’t guaranteed by the database, it becomes a problem that the application has to manage. When you choose avail‐ ability over consistency, you’re potentially making your application more complex. With proper replication and failover strategies, a da‐ tabase designed for consistency (such as HBase) can probably deliver the availability you require; but this is another design tradeoff. Re‐ gardless of the database you’re using, more stringent reliability re‐ quirements will drive you towards exotic engineering. Only you can decide the right balance for your application; the point isn’t that any given decision is right or wrong, but that you can (and have to) choose, and that’s a good thing. Other features I’ve completed a survey of the major tradeoffs you need to think about in selecting a database for a modern big data application. But the major tradeoffs aren’t the only story. There are many database projects with interesting features. Here are a some of the ideas and projects I find most interesting: • Scripting: Relational databases all come with some variation of the SQL language, which can be seen as a scripting language for data. In the non-relational world, a number of scripting languages are available. CouchDB and Riak support JavaScript, as does Mon‐ goDB. The Hadoop project has spawned a several data scripting languages that are usuable with HBase, including Pig and Hive. The Redis project is experimenting with integrating the Lua scripting language. • RESTful interfaces: CouchDB and Riak are unique in offering RESTful interfaces: interfaces based on HTTP and the architec‐ Other features | 199 tural style elaborated in Roy Fielding’s doctoral dissertation and Restful Web Services. CouchDB goes so far as to serve as a web application framework. Riak also offers a more traditional pro‐ tocol buffer interface, which is a better fit if you expect a high volume of small requests. • Graphs: Neo4J is a special purpose database designed for main‐ taining large graphs: data where the data items are nodes, with edges representing the connections between the nodes. Because graphs are extremely flexible data structures, a graph database can emulate any other kind of database. • SQL: I’ve been discussing the NoSQL movement, but SQL is a familiar language, and is always just around the corner. A couple of startups are working on adding SQL to Hadoop-based data‐ stores: DrawnToScale (which focuses on low-latency, high- volume web applications) and Hadapt (which focuses on analytics and bringing data warehousing into the 20-teens). In a few years, will we be looking at hybrid databases that take advantage of both relational and non-relational models? Quite possibly. • Scientific data: Yet another direction comes from SciDB, a data‐ base project aimed at the largest scientific applications (particu‐ larly the Large Synoptic Survey Telescope). The storage model is based on multi-dimensional arrays. It is designed to scale to hun‐ dreds of petabytes of storage, collecting tens of terabytes per night. It’s still in the relatively early stages. • Hybrid architectures: NoSQL is really about architectural choice. And perhaps the biggest expression of architectural choice is a hybrid architecture: rather than using a single database technol‐ ogy, mixing and matching technologies to play to their strengths. I’ve seen a number of applications that use traditional relational databases for the portion of the data for which the relational model works well, and a non-relational database for the rest. For exam‐ ple, customer data could go into a relational database, linked to a non-relational database for unstructured data such as product re‐ views and recommendations. It’s all about flexibility. A hybrid ar‐ chitecture may be the best way to integrate “social” features into more traditional ecommerce sites. These are only a few of the interesting ideas and projects that are floating around out there. Roughly a year ago, I counted a couple dozen non-relational database projects; I’m sure there are several times that 200 | The NoSQL Movement number today. Don’t hesitate to add notes about your own projects in the comments. In the End In a conversation with Eben Hewitt, author of Cassandra: The Defin‐ itive Guide, Eben summarized what you need to think about when architecting the back end of a data-driven system. They’re the same issues software architects have been dealing with for years: you need to think about the whole ecosystems in which the application works; you need to consider your goals (do you require high availability? fault tolerance?); you need to consider support options; you need to isolate what will changes over the life of the application, and separate that from what remains the same. The big difference is that now there are options; you don’t have to choose the relational model. There are other options for building large databases that scale horizontally, are highly available, and can deliver great performance to users. And these op‐ tions, the databases that make up the NoSQL movement, can often achieve these goals with greater flexibility and lower cost. It used to be said that nobody got fired for buying IBM; then nobody got fired for buying Microsoft; now, I suppose, nobody gets fired for buying Oracle. But just as the landscape changed for IBM and Micro‐ soft, it’s shifting again, and even Oracle has a NoSQL solution. Rather than relational databases being the default, we’re moving into a world where developers are considering their architectural options, and de‐ ciding which products fit their application: how the databases fit into their programming model, whether they can scale in ways that make sense for the application, whether they have strong or relatively weak consistency requirements. For years, the relational default has kept developers from understand‐ ing their real back-end requirements. The NoSQL movement has giv‐ en us the opportunity to explore what we really require from our da‐ tabases, and to find out what we already knew: there is no one-size- fits-all solution. In the End | 201 Why Visualization Matters Julie Steele A Picture Is Worth 1000 Rows Let’s say you need to understand thousands or even millions of rows of data, and you have a short time to do it in. The data may come from your team, in which case perhaps you’re already familiar with what it’s measuring and what the results are likely to be. Or it may come from another team, or maybe several teams at once, and be completely un‐ familiar. Either way, the reason you’re looking at it is that you have a decision to make, and you want to be informed by the data before making it. Something probably hangs in the balance: a customer, a product, or a profit. How are you going to make sense of all that information efficiently so you can make a good decision? Data visualization is an important an‐ swer to that question. However, not all visualizations are actually that helpful. You may be all too familiar with lifeless bar graphs, or line graphs made with soft‐ ware defaults and couched in a slideshow presentation or lengthy document. They can be at best confusing, and at worst misleading. But the good ones are an absolute revelation. The best data visualizations are ones that expose something new about the underlying patterns and relationships contained within the data. Understanding those relationships—and so being able to observe them—is key to good decision-making. The Periodic Table is a classic testament to the potential of visualization to reveal hidden relation‐ ships in even small data sets. One look at the table, and chemists and 203 middle school students alike grasp the way atoms arrange themselves in groups: alkali metals, noble gasses, halogens. If visualization done right can reveal so much in even a small data set like this, imagine what it can reveal within terabytes or petabytes of information. Types of Visualization It’s important to point out that not all data visualization is created equal. Just as we have paints and pencils and chalk and film to help us capture the world in different ways, with different emphases and for different purposes, there are multiple ways in which to depict the same data set. Or, to put it another way, think of visualization as a new set of languages you can use to communicate. Just as French and Russian and Japanese are all ways of encoding ideas so that those ideas can be transported from one person’s mind to another, and decoded again—and just as certain languages are more conducive to certain ideas—so the various kinds of data visualization are a kind of bidirectional encoding that lets ideas and information be transported from the database into your brain. Explaining and exploring An important distinction lies between visualization for exploring and visualization for explaining. A third category, visual art, comprises images that encode data but cannot easily be decoded back to the original meaning by a viewer. This kind of visualization can be beau‐ tiful, but is not helpful in making decisions. Visualization for exploring can be imprecise. It’s useful when you’re not exactly sure what the data has to tell you, and you’re trying to get a sense of the relationships and patterns contained within it for the first time. It may take a while to figure out how to approach or clean the data, and which dimensions to include. Therefore, visualization for exploring is best done in such a way that it can be iterated quickly and experimented upon, so that you can find the signal within the noise. Software and automation are your friends here. Visualization for explaining is best when it is cleanest. Here, the ability to pare down the information to its simplest form—to strip away the noise entirely—will increase the efficiency with which a decision- 204 | Why Visualization Matters maker can understand it. This is the approach to take once you un‐ derstand what the data is telling you, and you want to communicate that to someone else. This is the kind of visualization you should be finding in those presentations and sales reports. Visualization for explaining also includes infographics and other cat‐ egories of hand-drawn or custom-made images. Automated tools can be used, but one size does not fit all. Your Customers Make Decisions, Too While data visualization is a powerful tool for helping you and others within your organization make better decisions, it’s important to re‐ member that, in the meantime, your customers are trying to decide between you and your competitors. Many kinds of data visualization, from complex interactive or animated graphs to brightly-colored in‐ fographics, can help explain your customers explore and your cus‐ tomer service folks explain. That’s why kinds of companies and organizations, from GE to Trulia to NASA, are beginning to invest significant resources in providing interactive visualizations to their customers and the public. This al‐ lows viewers to better understand the company’s business, and interact in a self-directed manner with the company’s expertise. As Big Data becomes bigger, and more companies deal with complex data sets with dozens of variables, data visualization will become even more important. So far, the tide of popularity has risen more quickly than the tide of visual literacy, and mediocre efforts abound, in pre‐ sentations and on the web. But as visual literacy rises, thanks in no small part to impressive efforts in major media such as The New York Times and The Guardian, data visualization will increasingly become a language your customers and collaborators expect you to speak—and speak well. Do Yourself a Favor and Hire a Designer It’s well worth investing in a talented in-house designer, or a team of designers. Visualization for explaining works best when someone who understands not only the data itself, but also the principles of design and visual communication, tailors the graph or chart to the message. Your Customers Make Decisions, Too | 205 To go back to the language analogy: Google Translate is a powerful and useful tool for giving you the general idea of what a foreign text says. But it’s not perfect, and it often lacks nuance. For getting the overall gist of things, it’s great. But I wouldn’t use it to send a letter to a foreign ambassador. For something so sensitive, and where precision counts, it’s worth hiring an experienced human translator. Since data visualization is like a foreign language, in the same way, hire an experienced designer for important jobs where precision matters. If you’re making the kinds of decisions in which your customer, prod‐ uct, or profit hangs in the balance, you can’t afford to base those de‐ cisions on incomplete or misleading representations of the knowledge your company holds. Your designer is your translator, and one of the most important links you and your customers have to your data. Julie Steele is an editor at O’Reilly Media interested in connecting people and ideas. She finds beauty in discovering new ways to understand com‐ plex systems, and so enjoys topics related to gathering, storing, analyzing, and visualizing data. She holds a Master’s degree in Political Science (International Relations) from Rutgers University. 206 | Why Visualization Matters The Future of Big Data Edd Dumbill 2011 was the “coming out” year for data science and big data. As the field matures in 2012, what can we expect over the course of the year? More Powerful and Expressive Tools for Analysis This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Ha‐ doop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appli‐ ances and on-demand cloud services. Hopefully it won’t be long before that’s dull, yet necessary, infrastructure. Looking up the stack, there’s already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Ta‐ bleau and R. But there’s a way to go in making big data more powerful: that is, to decrease the cost of creating experiments. Here are two ways in which big data can be made more powerful. 207 1. Better programming language support. As we consider data, rath‐ er than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop ma‐ chinery. In other words: write shorter programs that make it clear what we’re doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non- programmers. 2. We require better support for interactivity. If Hadoop has any weakness, it’s in the batch-oriented nature of computation it fos‐ ters. The agile nature of data science will favor any tool that per‐ mits more interactivity. Streaming Data Processing Hadoop’s batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn’t need to be up- to-the-minute. However, batch processing isn’t always adequate, par‐ ticularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising. Over the next few years we’ll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large- scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use. For some applications, there just isn’t enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce. Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo. 208 | The Future of Big Data Rise of Data Marketplaces Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your cus‐ tomer data, and discover if there are weather related patterns to your customers’ purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft’s direction of integrating its Azure market‐ place right into analytical tools foreshadows the coming convenience of access to data. Development of Data Science Workflows and Tools As data science teams become a recognized part of companies, we’ll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company’s business operations, as opposed to being a sidecar analysis team. Software developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Green‐ plum’s Chorus, which provides a social software platform for data sci‐ ence. In turn, use of these tools will support the emergence of data science process within organizations. Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking Rise of Data Marketplaces | 209 work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist. Increased Understanding of and Demand for Visualization Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset. If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in deliv‐ ering data manipulation abilities to those without direct programming or statistical skills. Throughout a year dominated by business’ constant demand for data scientists, I’ve repeatedly heard from data scientists about what they want most: people who know how to create visualizations. 210 | The Future of Big Data Recommended Reading • Transforming Business • Disruptive Possibilities • Data Analysis with Open Source Tools • Strata Conference Santa Clara 2013: Complete Video Compilation • Data Virtualization for Business Intelligence Systems • Business Intelligence, 2nd Edition 211




需要 8 金币 [ 分享pdf获得金币 ] 0 人已下载





下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!