应用数据挖掘统计方法来进行商业和行业统计


Applied Data Mining Statistical Methods for Business and Industry PAOLO GIUDICI Faculty of Economics University of Pavia Italy Applied Data Mining Applied Data Mining Statistical Methods for Business and Industry PAOLO GIUDICI Faculty of Economics University of Pavia Italy Copyright  2003 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone (+44) 1243 779777 Email (for orders and customer service enquiries): cs-books@wiley.co.uk Visit our Home Page on www.wileyeurope.com or www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to (+44) 1243 770620. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Library of Congress Cataloging-in-Publication Data Giudici, Paolo. Applied data mining : statistical methods for business and industry / Paolo Giudici. p. cm. Includes bibliographical references and index. ISBN 0-470-84678-X (alk. paper) – ISBN 0-470-84679-8 (pbk.) 1. Data mining. 2. Business – Data processing. 3. Commercial statistics. I. Title. QA76.9.D343G75 2003 2003050196 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-470-84678-X (Cloth) ISBN 0-470-84679-8 (Paper) Typeset in 10/12pt Times by Laserwords Private Limited, Chennai, India Printed and bound in Great Britain by Biddles Ltd, Guildford, Surrey This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production. Contents Preface xi 1 Introduction 1 1.1 What is data mining? 1 1.1.1 Data mining and computing 3 1.1.2 Data mining and statistics 5 1.2 The data mining process 6 1.3 Software for data mining 11 1.4 Organisation of the book 12 1.4.1 Chapters 2 to 6: methodology 13 1.4.2 Chapters 7 to 12: business cases 13 1.5 Further reading 14 Part I Methodology 17 2 Organisation of the data 19 2.1 From the data warehouse to the data marts 20 2.1.1 The data warehouse 20 2.1.2 The data webhouse 21 2.1.3 Data marts 22 2.2 Classification of the data 22 2.3 The data matrix 23 2.3.1 Binarisation of the data matrix 25 2.4 Frequency distributions 25 2.4.1 Univariate distributions 26 2.4.2 Multivariate distributions 27 2.5 Transformation of the data 29 2.6 Other data structures 30 2.7 Further reading 31 3 Exploratory data analysis 33 3.1 Univariate exploratory analysis 34 3.1.1 Measures of location 35 3.1.2 Measures of variability 37 3.1.3 Measures of heterogeneity 37 vi CONTENTS 3.1.4 Measures of concentration 39 3.1.5 Measures of asymmetry 41 3.1.6 Measures of kurtosis 43 3.2 Bivariate exploratory analysis 45 3.3 Multivariate exploratory analysis of quantitative data 49 3.4 Multivariate exploratory analysis of qualitative data 51 3.4.1 Independence and association 53 3.4.2 Distance measures 54 3.4.3 Dependency measures 56 3.4.4 Model-based measures 58 3.5 Reduction of dimensionality 61 3.5.1 Interpretation of the principal components 63 3.5.2 Application of the principal components 65 3.6 Further reading 66 4 Computational data mining 69 4.1 Measures of distance 70 4.1.1 Euclidean distance 71 4.1.2 Similarity measures 72 4.1.3 Multidimensional scaling 74 4.2 Cluster analysis 75 4.2.1 Hierarchical methods 77 4.2.2 Evaluation of hierarchical methods 81 4.2.3 Non-hierarchical methods 83 4.3 Linear regression 85 4.3.1 Bivariate linear regression 85 4.3.2 Properties of the residuals 88 4.3.3 Goodness of fit 90 4.3.4 Multiple linear regression 91 4.4 Logistic regression 96 4.4.1 Interpretation of logistic regression 97 4.4.2 Discriminant analysis 98 4.5 Tree models 100 4.5.1 Division criteria 103 4.5.2 Pruning 105 4.6 Neural networks 107 4.6.1 Architecture of a neural network 109 4.6.2 The multilayer perceptron 111 4.6.3 Kohonen networks 117 4.7 Nearest-neighbour models 119 4.8 Local models 121 4.8.1 Association rules 121 4.8.2 Retrieval by content 126 4.9 Further reading 127 CONTENTS vii 5 Statistical data mining 129 5.1 Uncertainty measures and inference 129 5.1.1 Probability 130 5.1.2 Statistical models 132 5.1.3 Statistical inference 137 5.2 Non-parametric modelling 143 5.3 The normal linear model 146 5.3.1 Main inferential results 147 5.3.2 Application 150 5.4 Generalised linear models 154 5.4.1 The exponential family 155 5.4.2 Definition of generalised linear models 157 5.4.3 The logistic regression model 163 5.4.4 Application 164 5.5 Log-linear models 167 5.5.1 Construction of a log-linear model 167 5.5.2 Interpretation of a log-linear model 169 5.5.3 Graphical log-linear models 171 5.5.4 Log-linear model comparison 174 5.5.5 Application 175 5.6 Graphical models 177 5.6.1 Symmetric graphical models 178 5.6.2 Recursive graphical models 182 5.6.3 Graphical models versus neural networks 184 5.7 Further reading 185 6 Evaluation of data mining methods 187 6.1 Criteria based on statistical tests 188 6.1.1 Distance between statistical models 188 6.1.2 Discrepancy of a statistical model 190 6.1.3 The Kullback–Leibler discrepancy 192 6.2 Criteria based on scoring functions 193 6.3 Bayesian criteria 195 6.4 Computational criteria 197 6.5 Criteria based on loss functions 200 6.6 Further reading 204 Part II Business cases 207 7 Market basket analysis 209 7.1 Objectives of the analysis 209 7.2 Description of the data 210 7.3 Exploratory data analysis 212 7.4 Model building 215 7.4.1 Log-linear models 215 7.4.2 Association rules 218 viii CONTENTS 7.5 Model comparison 224 7.6 Summary report 226 8 Web clickstream analysis 229 8.1 Objectives of the analysis 229 8.2 Description of the data 229 8.3 Exploratory data analysis 232 8.4 Model building 238 8.4.1 Sequence rules 238 8.4.2 Link analysis 242 8.4.3 Probabilistic expert systems 244 8.4.4 Markov chains 245 8.5 Model comparison 250 8.6 Summary report 252 9 Profiling website visitors 255 9.1 Objectives of the analysis 255 9.2 Description of the data 255 9.3 Exploratory analysis 258 9.4 Model building 258 9.4.1 Cluster analysis 258 9.4.2 Kohonen maps 262 9.5 Model comparison 264 9.6 Summary report 271 10 Customer relationship management 273 10.1 Objectives of the analysis 273 10.2 Description of the data 273 10.3 Exploratory data analysis 275 10.4 Model building 278 10.4.1 Logistic regression models 278 10.4.2 Radial basis function networks 280 10.4.3 Classification tree models 281 10.4.4 Nearest-neighbour models 285 10.5 Model comparison 286 10.6 Summary report 290 11 Credit scoring 293 11.1 Objectives of the analysis 293 11.2 Description of the data 294 11.3 Exploratory data analysis 296 11.4 Model building 299 11.4.1 Logistic regression models 299 11.4.2 Classification tree models 303 11.4.3 Multilayer perceptron models 314 CONTENTS ix 11.5 Model comparison 314 11.6 Summary report 319 12 Forecasting television audience 323 12.1 Objectives of the analysis 323 12.2 Description of the data 324 12.3 Exploratory data analysis 327 12.4 Model building 337 12.5 Model comparison 347 12.6 Summary report 350 Bibliography 353 Index 357 Preface The increasing availability of data in the current information society has led to the need for valid tools for its modelling and analysis. Data mining and applied statistical methods are the appropriate tools to extract knowledge from such data. Data mining can be defined as the process of selection, exploration and modelling of large databases in order to discover models and patterns that are unknown a priori. It differs from applied statistics mainly in terms of its scope; whereas applied statistics concerns the application of statistical methods to the data at hand, data mining is a whole process of data extraction and analysis aimed at the production of decision rules for specified business goals. In other words, data mining is a business intelligence process. Although data mining is a very important and growing topic, there is insuf- ficient coverage of it in the literature, especially from a statistical viewpoint. Most of the available books on data mining are either too technical and com- puter science oriented or too applied and marketing driven. This book aims to establish a bridge between data mining methods and applications in the fields of business and industry by adopting a coherent and rigorous approach to statistical modelling. Not only does it describe the methods employed in data mining, typically com- ing from the fields of machine learning and statistics, but it describes them in relation to the business goals that have to be achieved, hence the word ‘applied’ in the title. The second part of the book is a set of case studies that compare the methods of the first part in terms of their performance and usability. The first part gives a broad coverage of all methods currently used for data mining and puts them into a functional framework. Methods are classified as being essentially computational (e.g. association rules, decision trees and neural networks) or sta- tistical (e.g. regression models, generalised linear models and graphical models). Furthermore, each method is classified in terms of the business intelligence goals it can achieve, such as discovery of local patterns, classification and prediction. The book is primarily aimed at advanced undergraduate and graduate students of business management, computer science and statistics. The case studies give guidance to professionals working in industry on projects involving large volumes of data, such as in customer relationship management, web analysis, risk man- agement and, more broadly, marketing and finance. No unnecessary formalisms xii PREFACE and mathematical tools are introduced. Those who wish to know more should consult the bibliography; specific pointers are given at the end of Chapters 2 to 6. The book is the result of a learning process that began in 1989, when I was a graduate student of statistics at the University of Minnesota. Since then my research activity has always been focused on the interplay between computa- tional and multivariate statistics. In 1998 I began building a group of data mining statisticians and it has evolved into a data mining laboratory at the University of Pavia. There I have had many opportunities to interact and learn from indus- try experts and my own students working on data mining projects and doing internships within the industry. Although it is not possible to name them all, I thank them and hope they recognise their contribution in the book. A special mention goes to the University of Pavia, in particular to the Faculty of Business and Economics, where I have been working since 1993. It is a very stimulating and open environment to do research and teaching. I acknowledge Wiley for having proposed and encouraged this effort, in par- ticular the statistics and mathematics editor and assistant editor, Sian Jones and Rob Calver. I also thank Greg Ridgeway, who revised the final manuscript and suggested several improvements. Finally, the most important acknowledgement goes to my wife, Angela, who has constantly encouraged the development of my research in this field. The book is dedicated to her and to my son Tommaso, born on 24 May 2002, when I was revising the manuscript. I hope people will enjoy reading the book and eventually use it in their work. I will be very pleased to receive comments at giudici@unipv.it. I will consider any suggestions for a subsequent edition. Paolo Giudici Pavia, 28 January 2003 CHAPTER 1 Introduction Nowadays each individual and organisation – business, family or institution – can access a large quantity of data and information about itself and its environ- ment. This data has the potential to predict the evolution of interesting variables or trends in the outside environment, but so far that potential has not been fully exploited. This is particularly true in the business field, the subject of this book. There are two main problems. Information is scattered within different archive systems that are not connected with one another, producing an inefficient organ- isation of the data. There is a lack of awareness about statistical tools and their potential for information elaboration. This interferes with the production of effi- cient and relevant data synthesis. Two developments could help to overcome these problems. First, software and hardware continually, offer more power at lower cost, allowing organisations to collect and organise data in structures that give easier access and transfer. Second, methodological research, particularly in the field of computing and statistics, has recently led to the development of flexible and scalable procedures that can be used to analyse large data stores. These two developments have meant that data mining is rapidly spreading through many businesses as an important intelligence tool for backing up decisions. This chapter introduces the ideas behind data mining. It defines data mining and compares it with related topics in statistics and computer science. It describes the process of data mining and gives a brief introduction to data mining software. The last part of the chapter outlines the organisation of the book and suggests some further reading. 1.1 What is data mining? To understand the term ‘data mining’ it is useful to look at the literal translation of the word: to mine in English means to extract. The verb usually refers to min- ing operations that extract from the Earth her hidden, precious resources. The association of this word with data suggests an in-depth search to find additional information which previously went unnoticed in the mass of data available. From the viewpoint of scientific research, data mining is a relatively new discipline that has developed mainly from studies carried out in other disciplines such as com- puting, marketing, and statistics. Many of the methodologies used in data mining Applied Data Mining. Paolo Giudici  2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 2 APPLIED DATA MINING come from two branches of research, one developed in the machine learning community and the other developed in the statistical community, particularly in multivariate and computational statistics. Machine learning is connected to computer science and artificial intelligence and is concerned with finding relations and regularities in data that can be trans- lated into general truths. The aim of machine learning is the reproduction of the data-generating process, allowing analysts to generalise from the observed data to new, unobserved cases. Rosenblatt (1962) introduced the first machine learning model, called the perceptron. Following on from this, neural networks devel- oped in the second half of the 1980s. During the same period, some researchers perfected the theory of decision trees used mainly for dealing with problems of classification. Statistics has always been about creating models for analysing data, and now there is the possibility of using computers to do it. From the second half of the 1980s, given the increasing importance of computational methods as the basis for statistical analysis, there was also a parallel development of statistical methods to analyse real multivariate applications. In the 1990s statisticians began showing interest in machine learning methods as well, which led to important developments in methodology. Towards the end of the 1980s machine learning methods started to be used beyond the fields of computing and artificial intelligence. In particular, they were used in database marketing applications where the available databases were used for elaborate and specific marketing campaigns. The term knowledge discovery in databases (KDD) was coined to describe all those methods that aimed to find relations and regularity among the observed data. Gradually the term KDD was expanded to describe the whole process of extrapolating information from a database, from the identification of the initial business aims to the application of the decision rules. The term ‘data mining’ was used to describe the component of the KDD process where the learning algorithms were applied to the data. This terminology was first formally put forward by Usama Fayaad at the First International Conference on Knowledge Discovery and Data Mining, held in Montreal in 1995 and still considered one of the main conferences on this topic. It was used to refer to a set of integrated analytical techniques divided into several phases with the aim of extrapolating previously unknown knowledge from massive sets of observed data that do not appear to have any obvious regularity or important relationships. As the term ‘data mining’ slowly established itself, it became a synonym for the whole process of extrapolating knowledge. This is the meaning we shall use in this text. The previous definition omits one important aspect – the ultimate aim of data mining. In data mining the aim is to obtain results that can be measured in terms of their relevance for the owner of the database – business advantage. Here is a more complete definition of data mining: Data mining is the process of selection, exploration, and modelling of large quan- tities of data to discover regularities or relations that are at first unknown with the aim of obtaining clear and useful results for the owner of the database. In a business context the utility of the result becomes a business result in itself. Therefore what distinguishes data mining from statistical analysis is not INTRODUCTION 3 so much the amount of data we analyse or the methods we use but that we integrate what we know about the database, the means of analysis and the business knowledge. To apply a data mining methodology means following an integrated methodological process that involves translating the business needs into a problem which has to be analysed, retrieving the database needed to carry out the analysis, and applying a statistical technique implemented in a computer algorithm with the final aim of achieving important results useful for taking a strategic decision. The strategic decision will itself create new measurement needs and consequently new business needs, setting off what has been called ‘the virtuous circle of knowledge’ induced by data mining (Berry and Linoff, 1997). Data mining is not just about the use of a computer algorithm or a statistical technique; it is a process of business intelligence that can be used together with what is provided by information technology to support company decisions. 1.1.1 Data mining and computing The emergence of data mining is closely connected to developments in computer technology, particularly the evolution and organisation of databases, which have recently made great leaps forward. I am now going to clarify a few terms. Query and reporting tools are simple and very quick to use; they help us explore business data at various levels. Query tools retrieve the information and reporting tools present it clearly. They allow the results of analyses to be transmit- ted across a client-server network, intranet or even on the internet. The networks allow sharing, so that the data can be analysed by the most suitable platform. This makes it possible to exploit the analytical potential of remote servers and receive an analysis report on local PCs. A client-server network must be flexible enough to satisfy all types of remote requests, from a simple reordering of data to ad hoc queries using Structured Query Language (SQL) for extracting and summarising data in the database. Data retrieval, like data mining, extracts interesting data and information from archives and databases. The difference is that, unlike data mining, the criteria for extracting information are decided beforehand so they are exogenous from the extraction itself. A classic example is a request from the marketing department of a company to retrieve all the personal details of clients who have bought product A and product B at least once in that order. This request may be based on the idea that there is some connection between having bought A and B together at least once but without any empirical evidence. The names obtained from this exploration could then be the targets of the next publicity campaign. In this way the success percentage (i.e. the customers who will actually buy the products advertised compared to the total customers contacted) will definitely be much higher than otherwise. Once again, without a preliminary statistical analysis of the data, it is difficult to predict the success percentage and it is impossible to establish whether having better information about the customers’ characteristics would give improved results with a smaller campaign effort. Data mining is different from data retrieval because it looks for relations and associations between phenomena that are not known beforehand. It also allows 4 APPLIED DATA MINING the effectiveness of a decision to be judged on the data, which allows a rational evaluation to be made, and on the objective data available. Do not confuse data mining with methods used to create multidimensional reporting tools, e.g. online analytical processing (OLAP). OLAP is usually a graphical instrument used to highlight relations between the variables available following the logic of a two- dimensional report. Unlike OLAP, data mining brings together all the variables available and combines them in different ways. It also means we can go beyond the visual representation of the summaries in OLAP applications, creating useful models for the business world. Data mining is not just about analysing data; it is a much more complex process where data analysis is just one of the aspects. OLAP is an important tool for business intelligence. The query and reporting tools describe what a database contains (in the widest sense this includes the data warehouse), but OLAP is used to explain why certain relations exist. The user makes his own hypotheses about the possible relations between the variables and he looks for confirmation of his opinion by observing the data. Suppose he wants to find out why some debts are not paid back; first he might suppose that people with a low income and lots of debts are high-risk categories. So he can check his hypothesis, OLAP gives him a graphical representation (called a multidimensional hypercube) of the empirical relation between the income, debt and insolvency variables. An analysis of the graph can confirm his hypothesis. Therefore OLAP also allows the user to extract information that is useful for business databases. Unlike data mining, the research hypotheses are suggested by the user and are not uncovered from the data. Furthermore, the extrapolation is a purely computerised procedure; no use is made of modelling tools or summaries provided by the statistical methodology. OLAP can provide useful information for databases with a small number of variables, but problems arise when there are tens or hundreds of variables. Then it becomes increasingly difficult and time- consuming to find a good hypothesis and analyse the database with OLAP tools to confirm or deny it. OLAP is not a substitute for data mining; the two techniques are comple- mentary and used together they can create useful synergies. OLAP can be used in the preprocessing stages of data mining. This makes understanding the data easier, because it becomes possible to focus on the most important data, identi- fying special cases or looking for principal interrelations. The final data mining results, expressed using specific summary variables, can be easily represented in an OLAP hypercube. We can summarise what we have said so far in a simple sequence that shows the evolution of business intelligence tools used to extrapolate knowledge from a database: QUERY AND REPORTING −−−→ DATA RETRIEVAL −−−→ OLAP −−−→ DATA MINING Query and reporting has the lowest information capacity and data mining has the highest information capacity. Query and reporting is easiest to implement and data mining is hardest to implement. This suggests a trade-off between information INTRODUCTION 5 capacity and ease of implementation. The choice of tool must also consider the specific needs of the business and the characteristics of the company’s information system. Lack of information is one of the greatest obstacles to achieving efficient data mining. Very often a database is created for reasons that have nothing to do with data mining, so the important information may be missing. Incorrect data is another problem. The creation of a data warehouse can eliminate many of these problems. Efficient organisation of the data in a data warehouse coupled with efficient and scalable data mining allows the data to be used correctly and efficiently to support company decisions. 1.1.2 Data mining and statistics Statistics has always been about creating methods to analyse data. The main difference between statistical methods and machine learning methods is that sta- tistical methods are usually developed in relation to the data being analysed but also according to a conceptual reference paradigm. Although this has made the statistical methods coherent and rigorous, it has also limited their ability to adapt quickly to the new methodologies arising from new information technology and new machine learning applications. Statisticians have recently shown an interest in data mining and this could help its development. For a long time statisticians saw data mining as a synonymous with ‘data fishing’, ‘data dredging’ or ‘data snooping’. In all these cases data mining had negative connotations. This idea came about because of two main criticisms. First, there is not just one theoretical reference model but several models in competition with each other; these models are chosen depending on the data being examined. The criticism of this procedure is that it is always possible to find a model, however complex, which will adapt well to the data. Second, the great amount of data available may lead to non-existent relations being found among the data. Although these criticisms are worth considering, we shall see that the modern methods of data mining pay great attention to the possibility of generalising results. This means that when choosing a model, the predictive performance is considered and the more complex models are penalised. It is difficult to ignore the fact that many important findings are not known beforehand and cannot be used in developing a research hypothesis. This happens in particular when there are large databases. This last aspect is one of the characteristics that distinguishes data mining from statistical analysis. Whereas statistical analysis traditionally concerns itself with analysing primary data that has been collected to check specific research hypotheses, data mining can also concern itself with secondary data collected for other reasons. This is the norm, for example, when analysing company data that comes from a data warehouse. Furthermore, statistical data can be experimental data (perhaps the result of an experiment which randomly allocates all the statis- tical units to different kinds of treatment), but in data mining the data is typically observational data. 6 APPLIED DATA MINING Berry and Linoff (1997) distinguish two analytical approaches to data min- ing. They differentiate top-down analysis (confirmative) and bottom-up analysis (explorative). Top-down analysis aims to confirm or reject hypotheses and tries to widen our knowledge of a partially understood phenomenon; it achieves this prin- cipally by using the traditional statistical methods. Bottom-up analysis is where the user looks for useful information previously unnoticed, searching through the data and looking for ways of connecting it to create hypotheses. The bottom-up approach is typical of data mining. In reality the two approaches are complemen- tary. In fact, the information obtained from a bottom-up analysis, which identifies important relations and tendencies, cannot explain why these discoveries are use- ful and to what extent they are valid. The confirmative tools of top-down analysis can be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries. There are at least three other aspects that distinguish statistical data analysis from data mining. First, data mining analyses great masses of data. This implies new considerations for statistical analysis. For many applications it is impossible to analyse or even access the whole database for reasons of computer efficiency. Therefore it becomes necessary to have a sample of the data from the database being examined. This sampling must take account of the data mining aims, so it cannot be performed using traditional statistical theory. Second many databases do not lead to the classic forms of statistical data organisation, for example, data that comes from the internet. This creates a need for appropriate analytical methods from outside the field of statistics. Third, data mining results must be of some consequence. This means that constant attention must be given to business results achieved with the data analysis models. In conclusion there are reasons for believing that data mining is nothing new from a statistical viewpoint. But there are also reasons to support the idea that, because of their nature, statistical methods should be able to study and formalise the methods used in data mining. This means that on one hand we need to look at the problems posed by data mining from a viewpoint of statistics and utility, while on the other hand we need to develop a conceptual paradigm that allows the statisticians to lead the data mining methods back to a scheme of general and coherent analysis. 1.2 The data mining process Data mining is a series of activities from defining objectives to evaluating results. Here are its seven phases: A. Definition of the objectives for analysis B. Selection, organisation and pretreatment of the data C. Exploratory analysis of the data and subsequent transformation D. Specification of the statistical methods to be used in the analysis phase E. Analysis of the data based on the chosen methods INTRODUCTION 7 F. Evaluation and comparison of the methods used and the choice of the final model for analysis G. Interpretation of the chosen model and its subsequent use in decision processes Definition of the objectives Definition of the objectives involves defining the aims of the analysis. It is not always easy to define the phenomenon we want to analyse. In fact, the company objectives that we are aiming for are usually clear, but the underlying problems can be difficult to translate into detailed objectives that need to be analysed. A clear statement of the problem and the objectives to be achieved are the prerequisites for setting up the analysis correctly. This is certainly one of the most difficult parts of the process since what is established at this stage determines how the subsequent method is organised. Therefore the objectives must be clear and there must be no room for doubts or uncertainties. Organisation of the data Once the objectives of the analysis have been identified, it is necessary to select the data for the analysis. First of all it is necessary to identify the data sources. Usually data is taken from internal sources that are cheaper and more reliable. This data also has the advantage of being the result of experiences and procedures of the company itself. The ideal data source is the company data warehouse, a storeroom of historical data that is no longer subject to changes and from which it is easy to extract topic databases, or data marts, of interest. If there is no data warehouse then the data marts must be created by overlapping the different sources of company data. In general, the creation of data marts to be analysed provides the fundamental input for the subsequent data analysis. It leads to a representation of the data, usually in a tabular form known as a data matrix, that is based on the analytical needs and the previously established aims. Once a data matrix is available it is often necessary to carry out a preliminary cleaning of the data. In other words, a quality control is carried out on the available data, known as data cleansing. It is a formal process used to highlight any variables that exist but which are not suitable for analysis. It is also an important check on the contents of the variables and the possible presence of missing, or incorrect data. If any essential information is missing, it will then be necessary to review the phase that highlights the source. Finally, it is often useful to set up an analysis on a subset or sample of the available data. This is because the quality of the information collected from the complete analysis across the whole available data mart is not always better than the information obtained from an investigation of the samples. In fact, in data mining the analysed databases are often very large, so using a sample of the data reduces the analysis time. Working with samples allows us to check the model’s validity against the rest of the data, giving an important diagnostic tool. It also reduces the risk that the statistical method might adapt to irregularities and lose its ability to generalise and forecast. 8 APPLIED DATA MINING Exploratory analysis of the data Exploratory analysis of the data involves a preliminary exploratory analysis of the data, very similar to OLAP techniques. An initial evaluation of the data’s importance can lead to a transformation of the original variables to better under- stand the phenomenon or it can lead to statistical methods based on satisfying specific initial hypotheses. Exploratory analysis can highlight any anomalous data – items that are different from the rest. These items data will not neces- sarily be eliminated because they might contain information that is important to achieve the objectives of the analysis. I think that an exploratory analysis of the data is essential because it allows the analyst to predict which statistical methods might be most appropriate in the next phase of the analysis. This choice must obviously bear in mind the quality of the data obtained from the previous phase. The exploratory analysis might also suggest the need for new extraction of data because the data collected is considered insufficient to achieve the set aims. The main exploratory methods for data mining will be discussed in Chapter 3. Specification of statistical methods There are various statistical methods that can be used and there are also many algorithms, so it is important to have a classification of the existing methods. The choice of method depends on the problem being studied or the type of data available. The data mining process is guided by the applications. For this reason the methods used can be classified according to the aim of the analysis. Then we can distinguish three main classes: • Descriptive methods: aim to describe groups of data more briefly; they are also called symmetrical, unsupervised or indirect methods. Observations may be classified into groups not known beforehand (cluster analysis, Kohonen maps); variables may be connected among themselves according to links unknown beforehand (association methods, log-linear models, graphical mod- els). In this way all the variables available are treated at the same level and there are no hypotheses of causality. Chapters 4 and 5 give examples of these methods. • Predictive methods: aim to describe one or more of the variables in relation to all the others; they are also called asymmetrical, supervised or direct meth- ods. This is done by looking for rules of classification or prediction based on the data. These rules help us to predict or classify the future result of one or more response or target variables in relation to what happens to the explana- tory or input variables. The main methods of this type are those developed in the field of machine learning such as the neural networks (multilayer per- ceptrons) and decision trees but also classic statistical models such as linear and logistic regression models. Chapters 4 and 5 both illustrate examples of these methods. • Local methods: aim to identify particular characteristics related to subset interests of the database; descriptive methods and predictive methods are global rather than local. Examples of local methods are association rules for INTRODUCTION 9 analysing transactional data, which we shall look at in Chapter 4, and the iden- tification of anomalous observations (outliers), also discussed in Chapter 4. I think this classification is exhaustive, especially from a functional viewpoint. Further distinctions are discussed in the literature. Each method can be used on its own or as one stage in a multistage analysis. Data analysis Once the statistical methods have been specified, they must be translated into appropriate algorithms for computing calculations that help us synthesise the results we need from the available database. The wide range of specialised and non-specialised software for data mining means that for most standard applica- tions it is not necessary to develop ad hoc algorithms; the algorithms that come with the software should be sufficient. Nevertheless, those managing the data mining process should have a sound knowledge of the different methods as well as the software solutions, so they can adapt the process to the specific needs of the company and interpret the results correctly when taking decisions. Evaluation of statistical methods To produce a final decision it is necessary to choose the best model of data analysis from the statistical methods available. Therefore the choice of the model and the final decision rule are based on a comparison of the results obtained with the different methods. This is an important diagnostic check on the validity of the specific statistical methods that are then applied to the available data. It is possible that none of the methods used permits the set of aims to be achieved satisfactorily. Then it will be necessary to go back and specify a new method that is more appropriate for the analysis. When evaluating the performance of a specific method, as well as diagnostic measures of a statistical type, other things must be considered such as time constraints, resource constraints, data quality and data availability. In data mining it is rarely a good idea to use just one statistical method to analyse the data. Different methods have the potential to highlight different aspects, aspects which might otherwise have been ignored. To choose the best final model it is necessary to apply and compare various techniques quickly and simply, to compare the results produced and then give a business evaluation of the different rules created. Implementation of the methods Data mining is not just an analysis of the data, it is also the integration of the results into the decision process of the company. Business knowledge, the extraction of rules and their participation in the decision process allow us to move from the analytical phase to the production of a decision engine. Once the model has been chosen and tested with a data set, the classification rule can be applied to the whole reference population. For example we will be able to distinguish beforehand which customers will be more profitable or we can 10 APPLIED DATA MINING calibrate differentiated commercial policies for different target consumer groups, thereby increasing the profits of the company. Having seen the benefits we can get from data mining, it is crucial to imple- ment the process correctly to exploit its full potential. The inclusion of the data mining process in the company organisation must be done gradually, setting out realistic aims and looking at the results along the way. The final aim is for data mining to be fully integrated with the other activities that are used to back up company decisions. This process of integration can be divided into four phases: • Strategic phase: in this first phase we study the business procedure being used in order to identify where data mining could give most benefits. The results at the end of this phase are the definition of the business objectives for a pilot data mining project and the definition of criteria to evaluate the project itself. • Training phase: this phase allows us to evaluate the data mining activity more carefully. A pilot project is set up and the results are assessed using the objectives and the criteria established in the previous phase. The choice of the pilot project is a fundamental aspect. It must be simple and easy to use but important enough to create interest. If the pilot project is positive, there are two possible results: the preliminary evaluation of the utility of the different data mining techniques and the definition of a prototype data mining system. • Creation phase: if the positive evaluation of the pilot project results in imple- menting a complete data mining system, it will then be necessary to establish a detailed plan to reorganise the business procedure to include the data min- ing activity. More specifically, it will be necessary to reorganise the business database with the possible creation of a data warehouse; to develop the pre- vious data mining prototype until we have an initial operational version; and to allocate personnel and time to follow the project. • Migration phase:atthisstageallweneedtodoispreparetheorgani- sation appropriately so the data mining process can be successfully inte- grated. This means teaching likely users the potential of the new system and increasing their trust in the benefits it will bring. This means constantly evaluating (and communicating) the efficient results obtained from the data mining process. For data mining to be considered a valid process within a company, it needs to involve at least three different people with strong communication and interactive skills: – Business experts, to set the objectives and interpret the results of data mining – Information technology experts, who know about the data and technolo- gies needed – Experts in statistical methods for the data analysis phase INTRODUCTION 11 1.3 Software for data mining A data mining project requires adequate software to perform the analysis. Most software systems only implement specific techniques; they can be seen as spe- cialised software systems for statistical data analysis. But because the aim of data mining is to look for relations that are previously unknown and to compare the available methods of analysis, I do not think these specialised systems are suitable. Valid data mining software should create an integrated data mining system that allows the use and comparison of different techniques; it should also integrate with complex database management software. Few such systems exist. Most of the available options are listed on the website www.kdnuggets.com/. This book makes many references to the SAS software, so here is a brief description of the integrated SAS data mining software called Enterprise Miner (SAS Institute, 2001). Most of the processing presented in the case studies is carried out using this system as well as other SAS software models. To plan, implement and successfully set up a data mining project it is nec- essary to have an integrated software solution that includes all the phases of the analytical process. These go from sampling the data, through the analytical and modelling phases, and up to the publication of the resulting business infor- mation. Furthermore, the ideal solution should be user-friendly, intuitive and flexible enough to allow the user with little experience in statistics to understand and use it. The SAS Enterprise Miner software is a solution of this kind. It comes from SAS’s long experience in the production of software tools for data analysis, and since it appeared on the market in 1998 it has become worldwide leader in this field. It brings together the system of statistical analysis and SAS reporting with a graphical user interface (GUI) that is easy to use and can be understood by company analysts and statistics experts. The GUI elements can be used to implement the data mining methods devel- oped by the SAS Institute, the SEMMA method. This method sets out some basic data mining elements without imposing a rigid and predetermined route for the project. It provides a logical process that allows business analysts and statistics experts to achieve the aims of the data mining projects by choosing the elements of the GUI they need. The visual representation of this structure is a process flow diagram (PFD) that graphically illustrates the steps taken to complete a single data mining project. The SEMMA method defined by the SAS Institute is a general reference structure that can be used to organise the phases of the data mining project. Schematically the SEMMA method set out by the SAS consists of a series of ‘steps’ that must be followed to complete the data analysis, steps which are perfectly integrated with SAS Enterprise Miner. SEMMA is an acronym that stands for ‘sample, explore, modify, model and assess: • Sample: this extracts a part of the data that is large enough to contain impor- tant information and small enough to be analysed quickly. 12 APPLIED DATA MINING • Explore: the data is examined to find beforehand any relations and abnormal- ities and to understand which data could be of interest. • Modify and model: these phases seek the important variables and the models that provide information contained in the data. • Assess: this assesses the utility and the reliability of the information discov- ered by the data mining process. The rules from the models are applied to the real environment of the analysis. 1.4 Organisation of the book This book is divided into two complementary parts. The first part describes the methodology and systematically treats data mining as a process of database analysis that tries to produce results which can be immediately used for decision making. The second part contains some case studies that illustrate data mining in real business applications. Figure 1.1 shows this organisation. Phases B, C, D A. Aims of the analysis (case studies) B. Organisation of the data (Chapter 2) C. Exploratory data analysis (Chapter 3) D. Statistical model specification (Chapters 4 and 5) E. Data analysis (case studies) F. Model evaluation and comparison (Chapter 6) G. Interpretation of the results (case studies) Figure 1.1 Organisation of the book. INTRODUCTION 13 and F receive one chapter each in the first part of the book; phases A, E and G will be discussed in depth in the second part of the book. Let us now look in greater detail at the two parts. 1.4.1 Chapters 2 to 6: methodology The first part of the book illustrates the main methodologies. Chapter 2 illustrates the main aspects related to the organisation of the data. It looks at the creation of a ready-to-analyse database, starting from examples of available structures – the data warehouse, the data webhouse, and the data mart – which can be easily transformed for statistical analysis. It introduces the important distinction between types of data, which can be quantitative and qualitative, nominal and ordinal, discrete and continuous. Data types are particularly important when specifying a model for analysis. The data matrix, which is the base structure of the statistical analysis, is discussed. Further on we look at some transformations of the matrix. Finally, other more complex data organisation structures are briefly discussed. Chapter 3 sets out the most important aspects of exploratory data analysis. It explains concepts and illustrates them with examples. It begins with univariate analysis and moves on to multivariate analysis. Two important topics are reducing the size of the data and analysing qualitative data. Chapters 4 and 5 describe the main methods used in data mining. We have used the ‘historical’ distinction between methods that do not require a proba- bilistic formulation (computational methods), many of which have emerged from machine learning, and methods that require a probabilistic formulation (statistical models), which developed in the field of statistics. The main computational methods illustrated in Chapter 4 are cluster analysis, decision trees and neural networks, both supervised and unsupervised. Finally, ‘local’ methods of data mining are introduced, and we will be looking at the most important of these, association and sequence rules. The methods illustrated in Chapter 5 follow the temporal evolution of multivariate statistical methods: from models of linear regression to generalised linear models that contain models of logistic and log-linear regression to reach graphical models. Chapter 6 discusses comparison and evaluation of the different models for data mining. It introduces the concept of discrepancy between statistical methods then goes on to discuss the most important evaluation criteria and the choice between the different models: statistical tests, criteria based on scoring functions, Bayesian criteria, computational criteria and criteria based on loss functions. 1.4.2 Chapters 7 to 12: business cases There are many applications for data mining. We shall discuss six of the most frequent applications in the business field, from the most traditional (customer rela- tionship management) to the most recent and innovative (web clickstream analysis). Chapter 7 looks at market basket analysis. It examines statistical methods for analysing sales figures in order to understand which products were bought 14 APPLIED DATA MINING together. This type of information makes it possible to increase sales of prod- ucts by improving the customer offering and promoting sales of other products associated with that offering. Chapter 8 looks at web clickstream analysis. It shows how information on the order in which the pages of a website are visited can be used to predict the visiting behaviour of the site. The data analysed corresponds to an e-commerce site and therefore it becomes possible to establish which pages influence electronic shopping of particular products. Chapter 9 looks at web profiling. Here we analyse data referring to the pages visited in a website, leading to a classification of those who visited the site based on their behaviour profile. With this information it is possible to get a behavioural segmentation of the users that can later be used when making mar- keting decisions. Chapter 10 looks at customer relationship management. Some statistical meth- ods are used to identify groups of homogeneous customers in terms of buying behaviour and socio-demographic characteristics. Identification of the different types of customer makes it possible to draw up a personalised marketing cam- paign, to assess its effects and to look at how the offer can be changed. Chapter 11 looks at credit scoring. Credit scoring is an example of the scoring procedure that in general gives a score to each statistical unit (customer, debtor, business, etc.) In particular, the aim of credit scoring is to associate each debtor with a numeric value that represents their credit worth. In this way it is possible to decide whether or not to give someone credit based on their score. Chapter 12 looks at prediction of TV shares. Some statistical linear models as well as others based on neural networks are presented to predict TV audiences in prime time on Italian TV. A company that sells advertising space can carry out an analysis of the audience to decide which advertisements to broadcast during certain programmes and at what time. 1.5 Further reading Since data mining is a recent discipline and is still undergoing great changes there are many sources of further reading. As well as the large number of tech- nical reports about the commercial software available, there are several articles available in specialised scientific journals as well as numerous thematic volumes. But there are still few complete texts on the topic. The bibliography lists relevant English-language books on data mining. Part of the material in this book is an elaboration from a book in Italian by myself (Giudici, 2001b). Here are the texts that have been most useful in writing this book. For the methodology • Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 INTRODUCTION 15 • David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Min- ing, MIT Press, 2001 • Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, 2001 For the applications • Olivia Par Rudd, Data Mining Cookbook, John Wiley & Sons, 2001 • Michael Berry and Gordon Lindoff, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons, 2000 • Michael Berry and Gordon Lindoff, Mastering Data Mining, John Wiley & Sons, 1997 One specialised scientific journal worth mentioning is Knowledge Discovery and Data Mining; it is the most important review for the whole sector. For introduc- tions and synthesis on data mining see the papers by Fayyad et al. (1996), Hand et al. (2000) and Giudici, Heckerman and Whittaker (2001). The internet is another important source of information. There are many sites dedicated to specific applications of data mining. This can make research using search engines quite slow. These two websites have a good number of links: • www.kdnuggets.com/ • www.dmreview.com/ There are many conferences on data mining that are often an important source of information and a way to keep up to date with the latest developments. Infor- mation about conferences can be found on the internet using search engines. PART I Methodology CHAPTER 2 Organisation of the data Data analysis requires that the data is organised into an ordered database, but I do not explain how to create a database in this text. The way data is analysed depends greatly on how the data is organised within the database. In our information society there is an abundance of data and a growing need for an efficient way of analysing it. However, an efficient analysis presupposes a valid organisation of the data. It has become strategic for all medium and large companies to have a unified information system called a data warehouse; this integrates, for example, the accounting data with data arising from the production process, the contacts with the suppliers (supply chain management), and the sales trends and the contacts with the customers (customer relationship management). This makes it possible to get precious information for business management. Another example is the increasing diffusion of electronic trade and commerce and, consequently, the abundance of data about websites visited along with any payment transactions. In this case it is essential for the service supplier, through the internet, to understand who the customers are in order to plan offers. This can be done if the transactions (which correspond to clicks on the web) are transferred to an ordered database, usually called a webhouse, that can later be analysed. Furthermore, since the information that can be extracted from a data mining process (data analysis) depends on how the data is organised, it is very important to involve the data analyst when setting up the database. Frequently, though, the analyst finds himself with a database that has already been prepared. It is then his job to understand how it has been set up and how best it can be used to meet the needs of the customer. When faced with poorly set up databases it is a good idea to ask for them to be reviewed rather than trying laboriously to extract information that might be of little use. This chapter looks at how database structure affects data analysis, how a database can be transformed for statistical analysis, and how data can be classified and put into a so-called data matrix. It considers how sometimes it may be a good idea to transform a data matrix in terms of binary variables, frequency distributions, or in other ways. Finally, it looks at examples of more complex data structures. Applied Data Mining. Paolo Giudici  2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 20 APPLIED DATA MINING 2.1 From the data warehouse to the data marts The creation of a valid database is the first and most important operation that must be carried out in order to obtain useful information from the data mining process. This is often the most expensive part of the process in terms of the resources that have to be allocated and the time needed for implementation and development. Although I cover it only briefly, this is an important topic and I advise you to consult other texts for more information, e.g. Berry and Linoff (1997), Han and Kamber (2001) and Hand, Mannila and Smyth (2001). I shall now describe examples of three database structures for data mining analysis: the data warehouse, the data webhouse and the data mart. The first two are complex data structures, but the data mart is a simpler database that usually derives from other data structures (e.g. from operational and transactional databases, but also from the data warehouse) that are ready to be analysed. 2.1.1 The data warehouse According to Immon (1996), a data warehouse is ‘an integrated collection of data about a collection of subjects (units), which is not volatile in time and can support decisions taken by the management’. From this definition, the first characteristic of a data warehouse is the orienta- tion to the subjects. This means that data in a data warehouse should be divided according to subjects rather than by business. For example, in the case of an insur- ance company the data put into the data warehouse should probably be divided into Customer, Policy and Insurance Premium rather than into Civil Responsi- bility, Life and Accident. The second characteristic is data integration, and it is certainly the most important. The data warehouse must be able to integrate itself perfectly with the multitude of standards used by the different applications from which data is collected. For example, various operational business applications could codify the sex of the customer in different ways and the data warehouse must be able to recognise these standards unequivocally before going on to store the information. Third, a data warehouse can vary in time since the temporal length of a data warehouse usually oscillates between 5 and 10 years; during this period the data collected is no more than a sophisticated series of instant photos taken at specific moments in time. At the same time, the data warehouse is not volatile because data is added rather than updated. In other words, the set of photos will not change each time the data is updated but it will simply be integrated with a new photo. Finally, a data warehouse must produce information that is relevant for management decisions. This means a data warehouse is like a container of all the data needed to carry out business intelligence operations. It is the main difference between a data warehouse and other business databases. Trying to use the data contained in the operational databases to carry out relevant statistical analysis for the business (related to various management decisions) is almost impossible. On the other hand, a data warehouse is built with this specific aim in mind. ORGANISATION OF THE DATA 21 There are two ways to approach the creation of a data warehouse. The first is based on the creation of a single centralised archive that collects all the com- pany information and integrates it with information coming from outside. The second approach brings together different thematic databases, called data marts, that are not initially connected among themselves, but which can evolve to create a perfectly interconnected structure. The first approach allows the system admin- istrators to constantly control the quality of the data introduced. But it requires careful programming to allow for future expansion to receive new data and to con- nect to other databases. The second approach is initially easier to implement and is therefore the most popular solution at the moment. Problems arise when the vari- ous data marts are connected among each other, as it becomes necessary to make a real effort to define, clean and transform the data to obtain a sufficiently uniform level. That is until it becomes a data warehouse in the real sense of the word. In a system that aims to preserve and distribute data, it is also necessary to include information about the organisation of the data itself. This data is called metadata and it can be used to increase the security levels inside the data warehouse. Although it may be desirable to allow vast access to information, some specific data marts and some details might require limited access. Metadata is also essential for management, organisation and the exploitation of the various activities. For an analyst it may be very useful to know how the profit variable was calculated, whether the sales areas were divided differently before a certain date, and how a multiperiod event was split in time. The metadata therefore helps to increase the value of the information present in the data warehouse because it becomes more reliable. Another important component of a data warehouse system is a collection of data marts. A data mart is a thematic database, usually represented in a very simple form, that is specialised according to specific objectives (e.g. marketing purposes). To summarise, a valid data warehouse structure should have the following components: (a) a centralised archive that becomes the storehouse of the data; (b) a metadata structure that describes what is available in the data warehouse and where it is; (c) a series of specific and thematic data marts that are easily accessible and which can be converted into statistical structures such as data matrices (Section 2.3). These components should make the data warehouse eas- ily accessible for business intelligence needs, ranging from data querying and reporting to OLAP and data mining. 2.1.2 The data webhouse The data warehouse developed rapidly during the 1990s, when it was very successful and accumulated widespread use. The advent of the web with its rev- olutionary impact has forced the data warehouse to adapt to new requirements. In this new era the data warehouse becomes a web data warehouse or, more simply, data webhouse. The web offers an immense source of data about people who use their browser to interact on websites. Despite the fact that most of the data related to the flow of users is very coarse and very simple, it gives detailed 22 APPLIED DATA MINING information about how internet users surf the net. This huge and undisciplined source can be transferred to the data webhouse, where it can be put together with more conventional sources of data that previously formed the data warehouse. Another change concerns the way in which the data warehouse can be accessed. It is now possible to exploit all the interfaces of the business data warehouse that already exist through the web just by using the browser. With this it is possible to carry out various operations, from simple data entry to ad hoc queries through the web. In this way the data warehouse becomes completely distributed. Speed is a fundamental requirement in the design of a webhouse. However, in the data warehouse environment some requests need a long time before they will be sat- isfied. Slow time processing is intolerable in an environment based on the web. A webhouse must be quickly reachable at any moment and any interruption, however brief, must be avoided. 2.1.3 Data marts A data mart is a thematic database that was originally oriented towards the marketing field. Indeed, its name is a contraction of marketing database. In this sense it can be considered a business archive that contains all the information connected to new and/or potential customers. In other words, it refers to a database that is completely oriented to managing customer relations. As we shall see, the analysis of customer relationship management data is probably the main field where data mining can be applied. In general, it is possible to extract from a data warehouse as many data marts as there are aims we want to achieve in a business intelligence analysis. However, a data mart can be created, although with some difficulty, even when there is no integrated warehouse system. The creation of thematic data structures like data marts represents the first and fundamental move towards an informative environment for the data mining activity. There is a case study in Chapter 10. 2.2 Classification of the data Suppose we have a data mart at our disposal, which has been extracted from the databases available according to the aims of the analysis. From a statistical viewpoint, a data mart should be organised according to two principles: the statis- tical units, the elements in the reference population that are considered important for the aims of the analysis (e.g. the supply companies, the customers, the peo- ple who visit the site) and the statistical variables, the important characteristics, measured for each statistical unit (e.g. the amounts customers buy, the payment methods they use, the socio-demographic profile of each customer). The statistical units can refer to the whole reference population (e.g. all the customers of the company) or they can be a sample selected to represent the whole population. There is a large body of work on the statistical theory of sampling and sampling strategies; for further information see Barnett (1975). If we consider an adequately representative sample rather than a whole population, there are several advantages. It might be expensive to collect complete information about the entire population and the analysis of great masses of data could waste a lot of time in ORGANISATION OF THE DATA 23 analysing and interpreting the results (think about the enormous databases of daily telephone calls available to mobile phone companies). The statistical variables are the main source of information to work on in order to extract conclusions about the observed units and eventually to extend these conclusions to a wider population. It is good to have a large number of variables to achieve these aims, but there are two main limits to having an excessively large number. First of all, for efficient and stable analyses the variables should not duplicate information. For example, the presence of the customers’ annual income makes monthly income superfluous. Furthermore, for each statistical unit the data should be correct for all the variables considered. This is difficult when there are many variables, because some data can go missing; missing data causes problems for the analysis. Once the units and the interest variables in the statistical analysis of the data have been established, each observation is related to a statistical unit, and a distinct value (level) for each variable is assigned. This process is known as classification. In general it leads to two different types of variable: qualitative and quantitative. Qualitative variables are typically expressed as an adjectival phrase, so they are classified into levels, sometimes known as categories. Some examples of qualitative variables are sex, postal code and brand preference. Qualitative data is nominal if it appears in different categories but in no particular order; qualitative data is ordinal if the different categories have an order that is either explicit or implicit. The measurement at a nominal level allows us to establish a relation of equality or inequality between the different levels (=, =) . Examples of nominal measure- ments are the eye colour of a person and the legal status of a company. Ordinal measurements allow us to establish an order relation between the different cate- gories but they do not allow any significant numeric assertion (or metric) on the difference between the categories. More precisely, we can affirm which category is bigger or better but we cannot say by how much (=, >, <). Examples of ordinal measurements are the computing skills of a person and the credit rate of a company. Quantitative variables are linked to intrinsically numerical quantities, such as age and income. It is possible to establish connections and numerical relations among their levels. They can be divided into discrete quantitative variables when they have a finite number of levels, and continuous quantitative variables if the levels cannot be counted. A discrete quantitative variable is the number of telephone calls received in a day; a continuous quantitative variable is the annual revenues of a company. Very often the ordinal level of a qualitative variable is marked with a number. This does not transform the qualitative variable into a quantitative variable, so it is not possible to establish connections and relations between the levels themselves. 2.3 The data matrix Once the data and the variables have been classified into the four main types (qualitative nominal, qualitative ordinal, quantitative discrete and quantitative 24 APPLIED DATA MINING continuous), the database must be transformed into a structure that is ready for statistical analysis. In the case of thematic databases this structure can be described by a data matrix. The data matrix is a table that is usually two- dimensional, where the rows represent the n statistical units considered and the columns represent the p statistical variables considered. Therefore the generic element (i,j) of the matrix (i = 1,...,n and j = 1,...,p) is a classification of the data related to the statistical unit i according to the level of the jth variable, as in Table 2.1. The data matrix is the point where data mining starts. In some cases, such as a joint analysis of quantitative variables, it acts as the input of the analysis phase. Other cases require pre-analysis phases (preprocessing or data transformation). This leads to tables derived from data matrices. For example, in the joint analysis of qualitative variables, since it is impossible to carry out a quantitative analysis directly on the data matrix, it is a good idea to transform the data matrix into a contingency table. This is a table with as many dimensions as there are qualitative variables considered. Each dimension is indexed by the level observed by the corresponding variable. Within each cell in the table we put the joint frequency of the corresponding crossover of the levels. We shall discuss this in more detail in the context of representing the statistical variables in frequency distributions. Table 2.2 is a real example of a data matrix. Lack of space means we can only see some of the 1000 lines included in the table and only some of the 21 columns. Chapter 11 will describe and analyse this table. Table 2.1 The data matrix. 1 ... j ... p 1 X1,1 X1,j X1,p ... i Xi,1 Xi,j Xi,p ... n Xn,1 Xn,j Xn,p Table 2.2 Example of a data matrix. YX1X2... X3 ... X20 N1 1118... 1049 ... 1 ... N34 1424... 1376 ... 1 ... N 1000 0130... 6350 ... 1 ORGANISATION OF THE DATA 25 Table 2.3 Example of binarisation. YX1X2X3 1 11 0 0 2 30 0 1 3 11 0 0 4 20 1 0 5 30 0 1 6 11 0 0 2.3.1 Binarisation of the data matrix If the variables in the data matrix are all quantitative, including some continuous ones, it is easier and simpler to treat the matrix as input without any pre-analysis. But if the variables are all qualitative or discrete quantitative, it is necessary to transform the data matrix into a contingency table (with more than one dimen- sion). This is not necessarily a good idea if p is large. If the variables in the data matrix belong to both types, it is best to transform the variables into the minority type, bringing them to the level of the others. For example, if most of the variables are qualitative and there are some quantitative variables, some of which are continuous, contingency tables will be used, preceded by the discreti- sation of the continuous variables into interval classes. This results in a loss of information. If most of the variables are quantitative, the best solution is to make the qualitative variables metric. This is called binarisation. Consider a binary variable set to 0 in the presence of a certain level and 1 if this level is absent. We can define a distance for this variable, so it can be seen as a quantitative variable. In the binarisation approach, each qualitative variable is transformed into as many binary variables as there are levels of the same type. For example, if a qualitative variable X has r levels, then r binary variables will be created as follows: for the generic level i, the corresponding binary variable will be set to 1 when X is equal to i, otherwise it will be set to 0. Table 2.3 shows a qualitative variable with three levels (indicated by Y) transformed into the three binary variables X1, X2, X3. 2.4 Frequency distributions Often it seems natural to summarise statistical variables by the co-occurrence of their levels. A summary of this type is called a frequency distribution. In all procedures of this kind, the summary makes it easier to analyse and present the results, but it also leads to a loss of information. In the case of qualitative variables, the summary is justified by the need to carry out quantitative analysis on the data. In other situations, such as with quantitative variables, the summary is essentially to simplify the analysis and presentation of results. 26 APPLIED DATA MINING 2.4.1 Univariate distributions First we will concentrate on univariate analysis, the analysis of a single vari- able. This simplifies presentation of results but it also simplifies the analytical method. It is easier to extract information from a database by beginning with univariate analysis and then moving on to multivariate analysis. Determining the univariate distribution frequency from the data matrix is often the first step in a univariate exploratory analysis. To create a frequency distribution for a variable it is necessary to know the number of times each level appears in the data. This number is called the absolute frequency. The levels and their frequencies give the frequency distribution. The observations related to the variable being examined can be indicated as follows: x1,x2,...,xN , omitting the index related to the variable itself. The dis- tinct values between the N observations (levels) are indicated as x∗ 1 ,x∗ 2 ,...,x∗ k (k ≤ N). The frequency distribution is shown as in Table 2.4 where ni indi- cates the number of times level x∗ i appears (its absolute frequency). Note thatk i=1 ni = N,whereN is the number of classified units. Table 2.5 shows an example of a frequency distribution for a binary qualitative variable that will be analysed in Chapter 10. It can be seen from Table 2.5 that the data at hand is fairly balanced between the two levels. To make reading and interpretation easier, frequency distribution is usually presented with relative frequencies. The relative frequency of the level x∗ i , indi- cated by pi, is defined by the relationship between the absolute frequency ni and the total number of observations: pi = ni/N. Note that we have k i=1 pi = 1. Table 2.4 Univariate frequency distribution. Levels Absolute frequencies x∗ 1 n1 x∗ 2 n2 ... ... x∗ k nk Table 2.5 Example of a frequency distribution. Levels Absolute frequencies 0 1445 1 1006 ORGANISATION OF THE DATA 27 Table 2.6 Univariate relative frequency distribution. Levels Relative frequencies x∗ 1 p1 x∗ 2 p2 ... ... x∗ k pk Table 2.7 Example of a univariate relative frequency distribution. Modalities Relative frequencies 0 0.59 1 0.41 The results are shown in Table 2.6. For the frequency distribution in Table 2.5 we obtain the relative frequencies in Table 2.7. 2.4.2 Multivariate distributions Now we shall see how it is possible to create multivariate frequency distributions for the joint examination of more than one variable. We will look particularly at qualitative or discrete quantitative variables. For continuous quantitative multi- variate variables, it is better to work directly with the data matrix. Multivariate frequency distributions are represented by a contingency table. For clarity, we will mainly consider the case where two variables are examined at a time. This creates a bivariate distribution having a contingency table with two dimensions. Let X and Y be the two variables collected for N statistical units, which take on h levels for X, x∗ 1 ,...,x∗ h,andk levels for Y, y∗ 1 ,...,y∗ k . The result of the joint classification of the variables into a contingency table can be summarised by the pairs {(x∗ i ,y∗ j ), nxy (x∗ i ,y∗ j )} where nxy (x∗ i ,y∗ j ) indicates the number of statistical units, among the N considered, where the level pair (x∗ i ,y∗ j ) is observed. The value indicated by nxy (x∗ i ,y∗ j ) is called the absolute joint frequency which refers to the (x∗ i ,y∗ j ) pair. For simplicity we will often refer to nxy (x∗ i ,y∗ j ) with the symbol nij . Note that since N =  i  j nxy (x∗ i ,y∗ j ) is equal to the total number of clas- sified units, we can get relative joint frequencies from the equation pxy (xi,yj ) = nxy (x∗ i ,y∗ j ) N 28 APPLIED DATA MINING Table 2.8 A two-way contingency table. X\Y y∗ 1 y∗ 2 ... y∗ j ... y∗ k x∗ 1 nxy (x∗ 1 ,y∗ 1 ) nxy (x∗ 1 ,y∗ 2 ) ... nxy (x∗ 1 ,y∗ j ) ... nxy (x∗ 1 ,y∗ k ) nx(x∗ 1 ) x∗ 2 nxy (x∗ 2 ,y∗ 1 ) nxy (x∗ 2 ,y∗ 2 ) ... nxy (x∗ 2 ,y∗ j ) ... nxy (x∗ 2 ,y∗ k ) nx(x∗ 2 ) ... ... ... ... ... ... ... ... x∗ i nxy (x∗ i ,y∗ 1 ) nxy (x∗ i ,y∗ 2 ) ... nxy (x∗ i ,y∗ j ) ... nxy (x∗ i ,y∗ k ) nx(x∗ i ) ... ... ... ... ... ... ... ... x∗ h nxy (x∗ h,y∗ 1 ) nxy (x∗ h,y∗ 2 ) ... nxy (x∗ h,y∗ j ) ... nxy (x∗ h,y∗ k ) nx(x∗ h) ny(y∗ 1 ) ny(y∗ 2 ) ... ny(y∗ j ) ... ny(y∗ k ) N To classify the observations into a contingency table, we could mark the level of the variable X in the rows and the levels of the variable Y in the columns. In the table we will therefore include the joint frequencies, as shown in Table 2.8. Note that from the joint frequencies it is easy to get the marginal univariate frequencies of X and Y using the following equations: nX(x∗ i ) =  j nxy (x∗ i ,y∗ j ) nY (y∗ j ) =  i nxy (x∗ i ,y∗ j ) Table 2.8 reports absolute frequencies. It can also be expressed in terms of relative frequencies. This will lead to two analogous equations that determine marginal relative univariate frequencies. From a joint frequency distribution it is also possible to determine h frequency distributions of the variable Y, conditioned on the h levels of X.Eachofthese, indicated by (Y|X = x∗ i ), shows the distribution frequency of Y only for the observations where X = xi. For example, the frequency with which we observe Y = y∗ 1 conditional on X = x∗ i can be obtained from the ratio pY|X(y∗ 1 |x∗ i ) = pxy (x∗ i ,y∗ 1 ) pX(x∗ i ) where pxy indicates the distribution of the joint frequency of X and Y and pX the distribution of the marginal frequency (unidimensional) of X. Similarly, we can get k frequency distributions of the X conditioned on the k levels of Y. Statistical software makes it easy to create and analyse contingency tables. Consider a 2 × 2 table where X is the binary variable Npurchases (number of purchases) and Y = South (referring to the geographic area where the customer comes from); we will look at this in more detail in Chapter 10. The output TEAM FLY ORGANISATION OF THE DATA 29 Table 2.9 Example of a two-way contingency table: NPURCHASES (rows) by SOUTH (columns). 0, 1, Total ***************************** 0 , 1102 , 343 , 1445 , 44.96 , 13.99 , 58.96 , 76.26 , 23.74 , , 57.40 , 64.60 , ***************************** 1 , 818 , 188 , 1006 , 33.37 , 7.67 , 41.04 , 81.31 , 18.69 , , 42.60 , 35.40 , ***************************** Total 1920 531 2451 78.34 21.66 100.00 in Table 2.9 shows the following four pieces of information, for each of the four possible levels for X and Y: (a) absolute frequency of the pair; (b) relative frequency of the pair; (c) conditional frequency of X = x, conditionally on the Y row; (d) conditional frequency of Y = y, conditionally on the X column. 2.5 Transformation of the data The transformation of the data matrix into univariate and multivariate frequency distributions is not the only possible transformation. Other transformations can also be very important to simplify the statistical analysis and/or the interpretation of results. For example when the p variables of the data matrix are expressed in different measurement units, it is a good idea to put all the variables into the same measurement unit so that the different measurement scales do not affect the results. This can be done using a linear transformation that standardises the variables, taking away the average of each one and dividing it by the square root of its variance. This produces a variable with a zero average and a unit variance. There are other particularly interesting data transformations, such as the non- linear Box–Cox transformation. The reader can find more on this in other books, such as Han and Kamber (2001). The transformation of data is also a way of solving problems with data quality, perhaps because items are missing or because there are anomalous values, known as outliers. There are two main ways to deal with missing data: (a) remove it, (b) substitute it using the remaining data. Identifying anomalous values is often a motivation for data mining in the first place. The discovery of anomalous values requires a formal statistical analysis; an anomalous value can seldom 30 APPLIED DATA MINING be eliminated as its existence often provides important information about the descriptive or predictive model connected to the data under examination. For example, in the analysis of fraud detections, perhaps related to telephone calls or credit cards, the aim is to identify suspicious behaviour. Han and Kamber (2001) provide more information on data quality and its problems. 2.6 Other data structures Some data mining applications may require a thematic database not expressible in terms of the data matrix we have considered up to now. For example, there are often other aspects to be considered such as the time and space in which the data is collected. Often in this kind of application the data is aggregated or divided (e.g. into periods or regions); for more on this topic see Diggle, Liang and Zeger (1994). The most important case refers to longitudinal data, for example, the surveys in n companies of the p budget variables in q successive years, or surveys of socio-economic indicators for the regions in a periodic (e.g. decennial) census. In this case there will be a three-way matrix which could be described by three dimensions, concerning n statistical units, p statistical variables and q times. Another important case is data related to different geographic areas. Here too there is a three-way matrix with space as the third dimension, for example, the sales of a company in different regions or the satellite surveys of the environmental characteristics of different regions. In both these cases, data mining should be accompanied by specific methods from time series analysis (Chatfield, 1996) or from spatial data analysis (Cressie, 1991). Developments in the information society have meant that data is now wider- ranging and increasingly complex; it is not structured and that makes it difficult to represent in the form of data matrices (even in extended forms as in the previous cases). Three important examples are text data, web data and multimedia data. Text databases consist of a mass of text documents usually connected by logical relations. Web data is contained in log files that describe what each visitor to a website does during his interaction with the site. Multimedia data can be made up of texts, images, sounds and other forms of audio-visual information that are typically downloaded from the internet and that describe an interaction with the website more complex than the previous example. This type of data analysis creates a more complex situation. The first difficulty concerns the organisation of the data; that is an important and very modern topic of research (e.g. Han and Kamber, 2001). There are still very few statistical applications for analysing this data. Chapter 8 tries to provide a statistical contribution to the analysis of these important problems; it shows how an appropriate analysis of the web data contained in the log file can give us important data mining results about access to websites. Another important type of complex data structure arises from the integration of different databases. In the modern applications of data mining it is often necessary to combine data that comes from different sources of data; one example is the ORGANISATION OF THE DATA 31 integration of official statistics from the European Statistics Office, Eurostat. Up to now this data fusion problem has been discussed mainly from a computational viewpoint (Han and Kamber, 2001). Some data is now observable in continuous time rather than discrete time. In this case the observations for each variable on each unit are more like a function than a point value. Important examples include monitoring the presence of polluting atmospheric agents over time and surveys on the quotation of various financial shares. These are examples of continuous time stochastic processes (Hoel, Port and Stone, 1972). 2.7 Further reading This chapter introduced the organisation and structure of databases for data min- ing. The most important idea is that the planning and creation of the database cannot be ignored. They are crucial to obtaining results that can be used in the subsequent phases of the analysis. I see data mining as part of a complete process of design, collection and data analysis with the aim of obtaining useful results for companies in the sphere of business intelligence. Database creation and data analysis are closely connected. The chapter started with a description of the various ways we can structure databases, with particular reference to the data warehouse, the data webhouse and the data mart. For more details on these topic, Han and Kamber (2001) take a computational viewpoint and Berry and Linoff (1997, 2000) take a business- oriented viewpoint. The fundamental themes from descriptive statistics are measurement scales and data classification. This leads to an important taxonomy of the statistical variables that is the basis of my operational distinction of data mining methods. Next comes the data matrix. The data matrix is a very important tool in data mining that allows us to define the objectives of the subsequent analysis according to the formal language of statistics. For an introduction to these concepts see for instance Hand et al. (2001). The chapter introduced some operations on the data matrix. These operations may be essential or they may be just a good idea. Examples are binarisation, the calculation of frequency distributions, variable transformations, and the treatment of anomalous or missing data. Hand et al. (2001) take a statistical viewpoint and Han and Kamber (2001) take a computational viewpoint. Finally, we briefly touched on the description of complex data structures; for more details consult the previous two books. CHAPTER 3 Exploratory data analysis In a quality statistical data analysis the initial step has to be exploratory. This is particularly true of applied data mining, which essentially consists of searching for relationships in the data at hand, not known a priori. Exploratory data anal- ysis has to take the available information organised as explained in Chapter 2, then analyse it, to summarise the whole data set. This is usually carried out through potentially computationally intensive graphical representations and sta- tistical summary measures, relevant for the aims of the analysis. Exploratory data analysis could seem equivalent to data mining itself, but there are two main differences. From the statistical viewpoint, exploratory data analysis essentially uses descriptive statistical techniques, whereas data mining can use descriptive and inferential methods; inferential methods are based on probabilistic techniques. There is a considerable difference between the purpose of data mining and exploratory analysis. The prevailing purpose of an exploratory analysis is to describe the structure and the relationships present in the data, for eventual use in a statistical model. The purpose of a data mining analysis is the direct production of decision rules based on the structures and models that describe the data. This implies, for example, a considerable difference in the use of concurrent techniques. An exploratory analysis is often composed of several exploratory techniques, each one capturing different and potentially noteworthy aspects of the data. In data mining, the various techniques are evaluated and compared in order to choose one that could subsequently be implemented as a decision rule. Coppi (2002) discusses the differences between exploratory data analysis and data mining. This chapter takes an operational approach to exploratory data analysis. It begins with univariate exploratory analysis – examining the variables one at a time. Even though the observed data is multidimensional and we will eventually need to consider the interrelationships between the variables, we can gain a lot of insight from examining each variable on its own. Next comes bivariate analysis. At this stage, the treatment of bivariate and multivariate analysis will use quantitative variables exclusively. This is followed by multivariate exploratory analysis of qualitative data. In particular, we will compare some of the numerous summary measures in the statistical literature. It is difficult to analyse data with many dimensions, so the Applied Data Mining. Paolo Giudici  2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 34 APPLIED DATA MINING final section looks at principal component analysis (PCA), a popular method for reducing dimensionality. 3.1 Univariate exploratory analysis Analysis of the individual variables is an important step in preliminary data analysis. It can gather important information for later multivariate analysis and modelling. The main instruments of exploratory univariate analysis are univariate graphical displays and a series of summary indexes. Graphical displays differ according to the type of data. Bar charts and pie diagrams are commonly used to represent qualitative nominal data. The horizontal axis, or x-axis, of the bar chart indicates the variable’s categories, and the vertical axis, or y-axis, indicates the absolute or relative frequencies of a given level of the variable. The order of the variables along the horizontal axis generally has no significance. Pie diagrams divide the pie into wedges where each wedge’s area is proportional to the relative frequency of the variable level it represents. Frequency diagrams are typically used to represent ordinal qualitative and discrete quantitative variables. They are simply bar charts where the order in which the variables are inserted on the horizontal axis must correspond to the numeric order of the levels. To obtain a frequency distribution for continuous quantitative variables, first reclassify or discretise the variables into class intervals. Begin by establishing the width of each interval. Unless there are special reasons for doing otherwise, the convention is to adopt intervals with constant width or intervals with different widths but with the same frequency (equifrequent). This may lead to some loss of information, since it is assumed that the variable distributes in a uniform way within each class. However, reclassification makes it possible to obtain a summary that can reveal interesting patterns. The graphical representation of the continuous variables, reclassified into class intervals, is obtained through a histogram. To construct a histogram, the chosen intervals are positioned along the x-axis. A rectangle with area equal to the (relative) frequency of the same class is then built on every interval. The height of these rectangles represent the frequency density, indicated through an analytic function f(x), called the density function. In exploratory data analysis the density function assumes a constant value over each interval, corresponding to the height of the bar in the histogram. The density function can also be used to specify a continuous probability model; in this case f(x) will be a continuous function. The second part of the text has numerous graphical representations similar to those describe here. Using quantitative variables, Figure 3.1 shows an example of a frequency distribution and a histogram. They show, respectively, the distribution of the variables ‘number of components of a family in a region’ and ‘net returns, in thousands of C–– , of a set of enterprises’. So far we have seen how it is possible to graphically represent a univariate distribution. However, sometimes we need to further summarise all of the obser- vations. Therefore it is useful to construct statistical indexes that are well suited to summarising the important aspects of the observations under consideration. EXPLORATORY DATA ANALYSIS 35 −21.15 −10.968 −0.786 9.396 19.578 29.76 0 0.01 0.02 0.03 0.04 0.05 Density 0% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 24635(a) (b) Figure 3.1 (a) A frequency diagram and (b) a histogram. We now examine the main unidimensional or univariate statistical indexes; they can be categorised as indexes of location, variability, heterogeneity, concentra- tion, asymmetry and kurtosis. The exposition is brief and elementary; refer to the relevant textbooks for detailed methods. 3.1.1 Measures of location The most commonly used measure of location is the mean, computable only for quantitative variables. Given a set x1,x2,...,xN of N observations, the arith- metic mean (the mean for short) is given by x = x1 + x2 +···+xN N = xi N In calculating the arithmetic mean, the very large observations can counterbalance and even overpower the smallest ones. Since all observations are used in the calculation, any value or set of values can considerably affect the computed mean value. In financial data, where extreme outliers are common, this ‘overpowering’ 36 APPLIED DATA MINING happens often and robust alternatives to the mean are probably preferable as measures of location. The previous expression of the arithmetic mean is to be calculated on the data matrix. When univariate data is classified in terms of a frequency distribution, the arithmetic mean can also be calculated directly on the frequency distribution, leading to the same result, indeed saving computer time. When calculated on the frequency distribution, the arithmetic mean can be expressed as x = x∗ i pi This is known as the weighted arithmetic mean, where the x∗ i indicate the distinct levels that the variable can take and pi is the relative frequency of each of those levels. The arithmetic mean has some important properties: • The sum of the deviations from the mean is zero: (xi − x) = 0. • The arithmetic mean is the constant that minimises the sum of the squares of the deviations of each observation from the constant itself: mina (xi − a)2 = x. • The arithmetic mean is a linear operator: 1 N (a + bxi) = a + bx. A second simple index of position is the modal value or mode. The mode is a measure of location computable for all kinds of variables, including the qualitative nominal ones. For qualitative or discrete quantitative characters, the mode is the level associated with the greatest frequency. To estimate the mode of a continuous variable, we generally discretise the data intervals as we did for the histogram and compute the mode as the interval with the maximum density (corresponding to the maximum height of the histogram). To obtain a unique mode, the convention is to use the middle value of the mode’s interval. A third important measure of position is the median. In an ordered sequence of data the median is the value for which half the observations are greater and half are less. It divides the frequency distribution into two parts with equal area. The median is computable for quantitative variables and ordinal qualitative variables. Given N observations in non-decreasing order, the median is obtained as follows: • If N is odd, the median is the observation which occupies the position (N + 1)/2. • If N is even, the median is the mean of the observations that occupy positions N/2 and N/2 + 1. The median remains unchanged if the smallest and largest observations are sub- stituted with any other value that is still lower (or greater) than the median. For this reason, unlike the mean, anomalous or extreme values do not influence the median assessment of the distribution’s location. EXPLORATORY DATA ANALYSIS 37 As a generalisation of the median, one can consider the values that subdivide the frequency distribution into parts having predetermined frequencies or per- centages. Such values are called quantiles or percentiles. Of particular interest are the quartiles; these correspond to the values which divide the distribution into four equal parts. More precisely, the quartiles q1,q2,q3, the first, second and third quartile, are such that the overall relative frequency with which we observe values less than q1 is 0.25, less than q2 is 0.5 and less than q3 is 0.75. Note that q2 coincides with the median. 3.1.2 Measures of variability It is usually interesting to study the dispersion or variability of a distribution. A simple indicator of variability is the difference between the maximum observed value and the minimum observed value of a certain variable, known as the range. Another index is constructed by taking the difference between the third quartile and the first quartile, the interquartile range (IQR). The range is highly sensitive to extreme observations, but the IQR is a robust measure of spread for the same reason the median is a robust measure of location. Range and IQR are not used very often. The measure of variability most commonly used for quantitative data is the variance. Given a set x1,x2,...,xN of N quantitative observations of a variable X, and indicating with x their arithmetic mean, the variance is defined by σ 2(X) = 1 N (xi − x)2 the average squared deviation from the mean. When calculated on a sample rather then the whole population it is also denoted by s2; then using N − 1in the denominator instead of N makes s2 an unbiased estimate of the population variance (Section 5.1). When all the observations have the same value then the variance is zero. Unlike the mean, the variance is not a linear operator. It holds that Var(a + bX) = b2 Var(X). The variance squares the units in which X is measured. That is, if X measures a distance in metres, the variance will be in square metres. In practice it is more convenient to preserve the original units for the measure of spread; that is why the square root of the variance, known as the standard deviation, is often reported. Furthermore, to facilitate comparisons between different distributions, the coefficient of variation (CV) is often used. CV equals the standard deviation divided by the absolute value of the arithmetic mean of the distribution (CV is defined only when the mean is non-zero); it is a unitless measure of spread. 3.1.3 Measures of heterogeneity The measures in the previous section cannot be computed for qualitative data, but we can still measure dispersion by using the heterogeneity of the observed distribution. Consider the general representation of the frequency distribution of 38 APPLIED DATA MINING Table 3.1 Frequency distribution for a qualitative variable. Modality Relative frequencies x∗ 1 p1 x∗ 2 p2 ... ... x∗ k pk a qualitative variable with k levels (Table 3.1). In practice it is possible to have two extreme situations between which the observed distribution will lie: • Null heterogeneity is when all the observations have X equal to the same level; that is, pi = 1 for a certain i and pi = 0 for the other k–1 levels. • Maximum heterogeneity is when the observations are uniformly distributed among the k levels; that is, pi = 1/k for all i = 1,...,k. A heterogeneity index will have to attain its minimum in the first situation and its maximum in the second one. We now introduce two indexes that satisfy such conditions. The Gini index of heterogeneity is defined by G = 1 − k i=1 p2i It can be easily verified that the Gini index is equal to 0 in the case of perfect homogeneity and equal to 1–1/k in the case of maximum heterogeneity. To obtain a ‘normalised’ index, which takes values in the interval [0,1], the Gini index can be rescaled by its maximum value, giving the following relative index of heterogeneity: G = G (k − 1)/k The second index of heterogeneity is the entropy, defined by E =− k i=1 pi log pi This index equals 0 in the case of perfect homogeneity and log k in the case of maximum heterogeneity. To obtain a ‘normalised’ index, which assumes values EXPLORATORY DATA ANALYSIS 39 in the interval [0,1], we can rescale E by its maximum value, obtaining the following relative index of heterogeneity: E = E log k 3.1.4 Measures of concentration Concentration is very much related to heterogeneity. In fact, a frequency distri- bution is said to be maximally concentrated when it has null heterogeneity and minimally concentrated when it has maximal heterogeneity. It is interesting to examine intermediate situations, where the two concepts find a different interpre- tation. In particular, the concept of concentration applies to variables measuring transferable goods (quantitative and ordinal qualitative). The classical example is the distribution of a fixed amount of income among N individuals; we shall use this as a running example. Consider N non-negative quantities measuring a transferable characteristic placed in non-decreasing order: 0 ≤ x1 ≤···≤xN The aim is to understand the concentration of the characteristic among the N quantities, corresponding to different observations. Let Nx = xi, the total available amount, where x is the arithmetic mean. Two extreme situations can arise: • x1 = x2 =···=xN = x, corresponding to minimum concentration (equal income for the running example). • x1 = x2 =···=xN−1 = 0,xN = Nx, corresponding to maximum concentra- tion (only one unit gets all income). In general, we want to evaluate the degree of concentration, which usually lies between these two extremes. To do this, we are going to build a measure of the concentration. Define Fi = i N for i = 1,...,N Qi = x1 + x2 +···+xi Nx = i j=1 xj Nx , for i = 1,...,N For each i, Fi is the cumulative percentage of considered units, up to the ith unit and Qi is the cumulative percentage of the characteristic that belongs to the same first i units. It can be shown that: 0 ≤ Fi ≤ 1; 0 ≤ Qi ≤ 1 Qi ≤ Fi FN = QN = 1 40 APPLIED DATA MINING Let F0 = Q0 = 0 and consider the N + 1 pairs of coordinates (0,0), (F1,Q1),...,(FN−1,QN−1), (1,1). If we plot these points in the plane and join them with line segments, we obtain a piecewise linear curve called the concentration curve. To illustrate the concept, Table 3.2 contains the ordered income of seven indi- viduals and the calculations needed to obtain the concentration curve. Figure 3.2 shows the concentration curve obtained from the data. It also includes the 45◦ line corresponding to minimal concentration. Notice how the observed situation departs from the line of minimal concentration, and from the case of maximum concentration, described by a curve almost coinciding with the x-axis, at least until the (N − 1)th point. Table 3.2 Construction of the concentration curve. Income Fi Qi 0 0 11 1/7 11/256 15 2/7 26/256 20 3/7 46/256 30 4/7 76/256 50 5/7 126/256 60 6/7 186/256 70 1 1 0 0.2 0.4 0.6 0.8 1 Fi Q i Figure 3.2 Representation of the concentration curve. EXPLORATORY DATA ANALYSIS 41 A summary index of concentration is the Gini concentration index, based on the differences Fi − Qi. There are three points to note: • For minimum concentration, Fi − Qi = 0,i = 1, 2,...,N. • For maximum concentration, Fi − Qi = Fi,i = 1, 2,...,N − 1andFN − QN = 0. • In general, 0 median, (b) mean = median, (c) mean < median. * * MeQ1 Q3 T1 T2 outliers Figure 3.4 A boxplot. The boxplot permits us to identify the asymmetry of the considered distribution. If the distribution were symmetric, the median would be equidistant from Q1 and Q3; otherwise the distribution is skewed. For example, when the distance between Q3 and the median is greater than the distance between Q1 and the median, the distribution is skewed to the right. The boxplot also indicates the presence of anomalous observations, or outliers. Observations smaller than T1 or greater than T2 can be seen as outliers, at least on an exploratory basis. Figure 3.4 indicates that the median is closer to the first quartile than the third quartile, so the distribution seems skewed to the right. Moreover, some anomalous observations are present at the right tail of the distribution. Let us construct a summary statistical index that can measure a distribution’s degree of asymmetry. The proposed index is based on calculating µ3 = (xi − x)3 N known as the third central moment of the distribution. The asymmetry index is then defined by γ = µ3 σ 3 where σ is the standard deviation. From its definition, the asymmetry index is calculable only for quantitative variables. It can assume every real value (i.e. it is not normalised). Here are three particular cases: EXPLORATORY DATA ANALYSIS 43 • If the distribution is symmetric, γ = 0. • If the distribution is left asymmetric, γ<0. • If the distribution is right asymmetric, γ>0. 3.1.6 Measures of kurtosis Continuous data can be represented using a histogram. The form of the histogram gives information about the data. It is also possible to approximate, or even to interpolate, a histogram with a density function of a continuous type. In particular, when the histogram has a very large number of classes and each class is relatively narrow, the histogram can be approximated using a normal or Gaussian density function, which has the shape of a bell (Figure 3.5). In Figure 3.5 the x-axis represents the observed values and the y-axis repre- sents the values corresponding to the density function. The normal distribution is an important theoretical model frequently used in inferential statistical anal- ysis (Section 5.1). Therefore it is reasonable to construct a statistical index that measures the ‘distance’ of the observed distribution from the theoretical situation corresponding to perfect normality. The index of kurtosis is a simple index that allows us to check whether the examined data follows a normal distribution: β = µ4 µ2 2 where µ4 = (xi − x)4 N and µ2 = (xi − x)2 N This index is calculable only for quantitative variables and it can assume every real positive value. Here are three particular cases: • If the variable is perfectly normal, β = 3. −4 −20 2 4 0 0.1 0.2 0.3 0.4 x Figure 3.5 Normal approximation to the histogram. 44 APPLIED DATA MINING • If β<3 the distribution is called hyponormal (thinner with respect to the normal distribution having the same variance, so there is a lower frequency for values very distant from the mean). • If β>3 the distribution is called hypernormal (fatter with respect to the normal distribution, so there is a greater frequency for values very distant from the mean). There are other graphical tools useful for checking whether the examined data can be approximated using a normal distribution. The most common one is the so-called ‘quantile-quantile’ plot, often abbreviated to qq-plot. This is a graph in which the observed quantiles from the observed data are compared with the theoretical quantiles that would be obtained if the data came from a true normal distribution. The graph is a set of points on a plane. The closer they come to the 45◦ line passing through the origin, the more closely the observed data matches data from a true normal distribution. Consider the qq-plots in Figure 3.6 they demonstrate some typical situations that occur in actual data analysis. With most popular statistical software it is easy to obtain the indexes men- tioned in this section, plus others too. Table 3.3 shows an example of a typical Theoretical Theoretical Observed Observed (a) (b) (c) (d) Theoretical Theoretical Observed Observed Figure 3.6 Examples of qq-plots: (a) hyponormal distribution, (b) hypernormal distri- bution, (c) left asymmetric distribution, (d) right asymmetric distribution. EXPLORATORY DATA ANALYSIS 45 Table 3.3 Example of software output for univariate analysis: the variable is the extra return of an investment fund. Moments Quantiles N 120 Sum Wgts 120 100%Max 2029 99% 1454 Mean 150.2833 Sum 18034 75%Q3 427 95% 861 Std Dev 483.864 Variance 234124.3 50%Med 174.5 90% 643.5 Skewness 0.298983 Kurtosis 2.044782 25%Q1 -141 10% -445.5 CV 321.9678 Range 3360 0%Min -1331 5% -658.5 Q3-Q1 568 Mode 186 1% -924 Extremes Lowest Obs Highest Obs -1331( 71) 1131( 31) -924( 54) 1216( 103) -843( 19) 1271( 67) -820( 21) 1454( 81) -754( 50) 2029( 30) Missing Value . Count 140 % Count/Nobs 53.85 software output for this purpose, obtained from PROC UNIVARIATE of SAS. Besides the main measures of location, it gives the quantiles as well as the min- imum and the maximum observed values. The kurtosis index calculated by SAS actually corresponds to β − 3. 3.2 Bivariate exploratory analysis The relationship between two variables can be graphically represented using a scatterplot. Figure 3.7 shows the relationship between the observed values in two performance indicators, return on investment (ROI) and return on equity (ROE), for a set of business enterprises in the computer sector. There is a noticeable increasing trend in the relationship between the two variables. Both variables in Figure 3.7 are quantitative and continuous, but a scatterplot can be drawn for all kinds of variables. A real data set usually contains more than two variables, but it is still possible to extract interesting information from the analysis of every possible bivariate scatterplot between all pairs of the variables. We can create a scatterplot matrix in which every element corresponds to the scatterplot of the two corresponding variables indicated by the row and the column. Figure 3.8 is an example of a scatterplot matrix for real data on the weekly returns of an investment fund made up of international shares and a series of worldwide financial indexes. The period of observation for all the variables starts on 4 October 1994 and ends on 4 October 1999, for a total of 262 working days. Notice that the variable REND shows an increasing relationship with all financial indexes and, in particular, with 46 APPLIED DATA MINING −150 −100 −50 0 50 100 150 −30 −20 −100 10203040 ROI ROE Figure 3.7 Example of a scatterplot diagram. Figure 3.8 Example of a scatterplot matrix. the EURO, WORLD and NORDAM indexes. The squares containing the variable names also contain the minimum and maximum value observed for that variable. It is useful to develop bivariate statistical indexes that further summarise the frequency distribution, improving the interpretation of data, even though we may EXPLORATORY DATA ANALYSIS 47 lose some information about the distribution. In the bivariate case, and more generally in the multivariate case, these indexes permit us to summarise the distribution of each data variable, but also to learn about the relationship between the variables (corresponding to the columns of the data matrix). The rest of this section focuses on quantitative variables, for which summary indexes are more easily formulated, typically by working directly with the data matrix. Section 3.4 explains how to develop summary indexes that describe the relationship between qualitative variables. Concordance is the tendency of observing high (low) values of a variable together with high (low) values of the other. Discordance is the tendency of observing low (high) values of a variable together with high (low) values of the other. For measuring concordance, the most common summary measure is the covariance, defined as Cov(X, Y ) = 1 N N i=1 [xi − µ(X)][yi − µ(Y )] where µ(X) is the mean of variable X and µ(Y ) is the mean of variable Y. The covariance takes positive values if the variables are concordant and negative values if they are discordant. With reference to the scatterplot representation, setting the point (µ(X), µ(Y )) as the origin, Cov(X, Y) tends to be positive when most of the observations are in the upper right-hand and lower left-hand quadrants. Conversely, it tends to be negative when most of the observations are in the lower right-hand and upper left-hand quadrants. Notice that the covariance is directly calculable from the data matrix. In fact, since there is a covariance for each pair of variables, this calculation gives rise to a new data matrix, called the variance–covariance matrix. In this matrix the rows and columns correspond to the available variables. The main diagonal contains the variances and the cells outside the main diagonal contain the covariances between each pair of variables. Since Cov(Xj ,Xi) = Cov(Xi,Xj ), the resulting matrix will be symmetric (Table 3.4). Table 3.4 The variance–covariance matrix. X1 ... Xj ... Xh X1 Var(X1) ... Cov(X1,Xj ) ... Cov(X1,Xh) ... ... ... ... ... ... Xj Cov(Xj ,X1) ... Var(Xj ) ... ... ... ... ... ... ... ... Xh Cov(Xh,X1) ... ... ... Var(Xh) 48 APPLIED DATA MINING The covariance is an absolute index; that is, it can identify the presence of a relationship between two quantities but it says little about the degree of this rela- tionship. In other words, to use the covariance as an exploratory index, it need to be normalised, making it a relative index. The maximum value that Cov(X, Y ) can assume is σxσy, the product of the two standard deviations of the vari- ables. The minimum value that Cov(X, Y ) can assume is – σxσy.Furthermore, Cov(X, Y) assumes its maximum value when the observed data points lie on a line with positive slope; it assumes its minimum value when the observed data points lie on a line with negative slope. In light of this, we define the (linear) correlation coefficient between two variables X and Y as r(X,Y) = Cov(X, Y ) σ(X)σ(Y) The correlation coefficient r(X,Y) has the following properties: • r(X, Y) takes the value 1 when all the points corresponding to the joint observations are positioned on a line with positive slope, and it takes the value – 1 when all the points are positioned on a line with negative slope. That is why r is known as the linear correlation coefficient. • When r(X,Y) = 0 the two variables are not linked by any type of linear relationship; that is, X and Y are uncorrelated. • In general, −1 ≤ r(X,Y) ≤ 1. As for the covariance, it is possible to calculate all pairwise correlations directly from the data matrix, thus obtaining a correlation matrix. The structure of such a matrix is shown in Table 3.5. For the variables plotted in Figure 3.7 the cor- relation matrix is as in Table 3.6. Table 3.6 takes the ‘visual’ conclusions of Figure 3.7 and makes them stronger and more precise. In fact, the variable REND is strongly positively correlated with EURO, WORLD and NORDAM. In general, there are many variables exhibiting strong correlation. Interpreting the magnitude of the linear correlation coefficient is not partic- ularly easy. It is not clear how to distinguish the ‘high’ values from the ‘low’ Table 3.5 The correlation matrix. X1 ... Xj ... Xh X1 1 ... Cor(X1,Xj ) ... Cor(X1,Xh) ... ... ... ... ... ... Xj Cor(Xj ,X1) ... 1 ... ... ... ... ... ... ... ... Xh Cor(Xh,X1) ... ... ... 1 EXPLORATORY DATA ANALYSIS 49 Table 3.6 Example of a correlation matrix. values of the coefficient, in absolute terms, so that we can distinguish the impor- tant correlations from the irrelevant. Section 5.3 considers a model-based solution to this problem when examining statistical hypothesis testing in the context of the normal linear model. But to do that we need to assume the pair of variables have a bivariate Gaussian distribution. From an exploratory viewpoint, it would be convenient to have a threshold rule to inform us when there is substantial information in the data to reject the hypothesis that the correlation coefficient is zero. Assuming the observed sample comes from a bivariate normal distribution (Section 5.3), we can use a rule of the following type: Reject the hypothesis that the correlation coefficient is null when r(X,Y) 1 − r2(X, Y ) √ n − 2 >tα/2 where tα/2 is the (1 − α/2) percentile of a Student’s t distribution with n − 2 degrees of freedom, corresponding to the number of observations minus 2 (Sec- tion 5.1). For example, for a large sample and a significance level of α = 5% (which sets the probability of incorrectly rejecting a null correlation), the thresh- old is t0.025 = 1.96. The previous inequality asserts that we should conclude that the correlation between two variables is ‘significantly’ different from zero when the left-hand side is greater than tα/2. For example, applying the previous rule to Table 3.6, with tα/2 = 1.96, it turns out that all the observed correlations are significantly different from zero. 3.3 Multivariate exploratory analysis of quantitative data Matrix notation allows us to express multivariate measures more compactly. We assume that the data matrix is entirely composed of quantitative variables; Section 3.4 deals with qualitative variables. Let X beadatamatrixwithn rows and p columns. The main summary measures can be expressed directly in terms of matrix operations on X. For example, the arithmetic mean of the variables, described by a p-dimensional vector X, can be obtained directly from the data matrix as X = 1 n 1X 50 APPLIED DATA MINING where 1 indicates a (row) vector of length n with all the elements equal to 1. As we have seen in Section 2.5, it is often a good idea to standardise the variables in X. To achieve this aim, we first need to subtract the mean from each variable. The matrix containing the deviations from each variable’s mean is ˜X = X − 1 nJX where J is an n × n matrix with all the elements equal to 1. Consider now the variance–covariance matrix, denoted by S. S is a p × p square matrix containing the variance of each variable on the main diagonal. The off-diagonal elements contain the p(p − 1)/2 covariances between all the pairs of the p considered variables. In matrix notation we can write S = 1 n ˜X ˜X where ˜X represents the transpose of ˜X.The(i, j) element of the matrix is Si,j = 1 n n =1 (xi − xi)(xj − xj ) S is symmetric and positive definite, meaning that for any non-zero vector x, xSx > 0. It can be appropriate, for example in a comparisons between different databases, to summarise the whole variance–covariance matrix with a real number that expresses the ‘overall variability’ of the system. This can be done usually through two alternative measures. The trace, denoted by tr, is the sum of the elements on the main diagonal of S, the variances of the variables: tr(S) = p s=1 σ 2s It can be shown that the trace of S is equal to the sum of the eigenvalues of the matrix itself: tr(S) = p s=1 λs A second measure of overall variability is defined by the determinant of S and is often called the Wilks generalised variance: W =|S| We have seen how to transform the variance–covariance matrix to the correla- tion matrix so that we can interpret the relationships more easily. The correlation matrix, R, is computable as R = 1 nZZ EXPLORATORY DATA ANALYSIS 51 where Z = ˜XF is a matrix containing the standardised variables (Section 2.5) and F is a p × p matrix that has diagonal elements equal to the reciprocal of the standard deviations of the variables: F = [diag(s11,...,spp )]−1 Although the correlation matrix is very informative on the presence of statisti- cal (linear) relationships between the considered variables, in reality it calculates them marginally for every pair of variables, without including the influence of the remaining variables. To filter out spurious effects induced by other variables, a useful tool is the par- tial correlation. The partial correlation measures the linear relationship between two variables with the others held fixed. Let rij|REST be the partial correlation observed between the variables Xi and Xj ,givenall the remaining variables, and let K = R−1, the inverse of the correlation matrix. To calculate the partial correlation, it can be shown that rij|REST = −kij [kii kjj ]1/2 where kii ,kjj and kij are the elements at positions (i, i), (j, j) and (i, j) in matrix K. The importance of reasoning in terms of partial correlations is particularly evident in databases characterised by strong collinearities between the variables. For example, in an analysis developed on the correlation structure between daily performances of 12 sector indexes of the American stock market in the period 4/1/1999 to 29/2/2000, I have computed the marginal correlations between the NASDAQ100 index and the COMPUTER and BIOTECH sector indexes, obtain- ing 0.99 and 0.94 respectively. However, the corresponding partial correlations are smaller, 0.45 and 0.13 respectively. This occurs because there is strong cor- relation among all the considered indexes, therefore the marginal correlations tend also to reflect the spurious correlation between two variables induced by the others. The BIOTECH index has a smaller weight than the COMPUTER index in the NASDAQ100 index, in particular, as the partial correlation for the BIOTECH index is much lower. 3.4 Multivariate exploratory analysis of qualitative data So far we have used covariance and correlation as our main measures of statistical relationships between quantitative variables. With ordinal qualitative variables, it is possible to extend the notion of covariance and correlation to the ranks of the observations. The correlation between the variable ranks is known as the Spearman correlation coefficient. Table 3.7 shows how to express the ranks of two ordinal qualitative variables that describe the quality and the menu of four different restaurants. The Spearman correlation of the data in Table 3.7 is zero, therefore the ranks of the two variables are not correlated. 52 APPLIED DATA MINING Table 3.7 Ranking of ordinal variables. Variable A Variable B Ranks of variable A Ranks of variable B High Simple 3 1 Medium Intermediate 2 2 Medium Elaborated 2 3 Low Simple 1 1 More generally, transforming the levels of the ordinal qualitative variables into the corresponding ranks allows most of the analysis applicable to quanti- tative data to be extended to the ordinal qualitative case. This can also include principal component analysis (Section 3.5). However, if the data matrix contains qualitative data at the nominal level (not binary, otherwise they could be con- sidered quantitative, as in Section 2.3), the notion of covariance and correlation cannot be used. The rest of this section considers summary measures for the intensity of the relationships between qualitative variables of any kind. These measures are known as association indexes. These indexes can sometimes be applied to discrete quantitative variables, but with a loss of explanatory power. In the examination of qualitative variables, a fundamental part is played by the frequencies for the levels of the variables. Therefore we begin with the contingency table introduced in Section 2.4. Unlike Section 2.4, qualitative data are often available directly in the form of a contingency table, without need- ing to access the original data matrix. To emphasise this difference, we now introduce a slightly different notation which we shall use throughout. Given a qualitative character X which assumes the levels X1,...,XI , collected in a pop- ulation (or sample) of n units, the absolute frequency of level Xi (i = 1,...,I) is the number of times the variable X is observed having value Xi. Denote this absolute frequency by ni. Table 3.8 presents a theoretical two-way con- tingency table to introduce the notation used in this Section. In Table 3.8 nij Table 3.8 Theoretical two-way contingency table. Y X Y1 ... Yj ... YJ Total X1 n11 ... n1j ... n1J n1+ ... ... ... ... ... Xi ni1 ... nij ... niJ ni+ ... ... ... ... ... XI nI1 ... nIj ... nIJ nI+ Total n+1 ...n+j ...n+J n EXPLORATORY DATA ANALYSIS 53 indicates the frequency associated with the pair of levels (Xi,Yj ), i = 1, 2,...,I; j = 1, 2,...,J, of the variables X and Y.Thenij are also called cell frequencies. • ni+ = J j=1 nij is the marginal frequency of the ith row of the table; it represents the total number of observations which assume the ith level of X (i = 1, 2,...,I). • n+j = I i=1 nij is the marginal frequency of the jth column of the table; it denotes the total number of observations which assume the jth level of Y(j = 1, 2,...,J). For the frequencies in the table, we can write the following marginalisation relationship: I i=1 ni+ = J j=1 n+j = I i=1 J j=1 nij = n From an n × p data matrix, it is possible to construct p(p − 1)/2 two-way con- tingency tables, corresponding to all possible qualitative variable pairs. However, it is usually reasonable to limit ourselves to obtaining only those that correspond to interesting ‘intersections’ between the variables, those for which the joint dis- tribution may be important and a useful complement to the univariate frequency distribution. 3.4.1 Independence and association To develop descriptive indexes of the relationship between qualitative variables, we need the concept of statistical independence. Two variables, X and Y,are said to be independent, with reference to n observations, if they adhere to the following conditions: ni1 n+1 =···= niJ n+J = ni+ n ∀i = 1, 2,...,I or, equivalently, n1j n1+ =···= nIj nI+ = n+j n ∀j = 1, 2,...,J If this occurs it means that, with reference to the first equation, the (bivariate) analysis of the variables does not give any additional information about X beyond the univariate analysis of the variable X itself, and similarly for Y in the second equation. It will be said in this case that Y and X are statistically independent. From the definition, notice that statistical independence is a symmetric concept in the two variables; in other words, if X is independent of Y, then Y is indepen- dent of X. The previous conditions can be equivalently, and more conveniently, expressed as a function of the marginal frequencies ni+ and n+j .ThenX and Y 54 APPLIED DATA MINING are independent if nij = ni+n+j n ∀i = 1, 2,...,I; ∀j = 1, 2,...,J In terms of relative frequencies, this is equivalent to pXY (xi,yj ) = pX(xi)pY (yj ) for every i and for every j. When working with real data, the statistical indepen- dence condition is almost never satisfied exactly. Consequently, observed data will often show some degree of interdependence between the variables. The statistical independence notion applies to qualitative and quantitative vari- ables. A measure of interdependence operates differently for qualitative variables than for quantitative variables. For quantitative variables, it is possible to calculate summary measures (called correlation measures) that work both on the levels and the frequencies. For qualitative variables, the summary measures (called associ- ation measures) can use only the frequencies, because the levels are not metric. For quantitative variables, an important relationship holds between statistical independence and the absence of correlation. If two variables, X and Y,are statistically independent, then cov(X,Y) = 0andr(X,Y) = 0. The converse is not necessarily true, in the sense that two variables can be such that r(x,y) = 0, even though they are not independent. In other words, the absence of correlation does not imply statistical independence. An exception occurs when the variables X and Y are jointly distributed according to a normal multivariate distribution (Section 5.1). Then the two concepts are equivalent. The greater difficulty of using association measures compared with correlation measures lies in the fact that there are so many indexes available in the statistical literature. Here we examine three different classes: distance measures, dependency measures and model-based measures. 3.4.2 Distance measures Independence between two variables, X and Y, holds when nij = ni+n+j n ∀i = 1, 2,...,I; ∀j = 1, 2,...,J for all joint frequencies of the contingency table. A first approach to the summary of an association can therefore be based on calculating a ‘global’ measure of disagreement between the frequencies actually observed (nij ) and those expected in the hypothesis of independence between the two variables (ni+n+j /n). The original statistic proposed by Karl Pearson is the most widely used measure for verifying the hypothesis of independence between X and Y. In the general case, it is defined by X2 = I i=1 J j=1 (nij − n∗ ij ) n∗ ij 2 EXPLORATORY DATA ANALYSIS 55 where n∗ ij = ni+n+j n i = 1, 2,...,I; j = 1, 2,...,J Note that X2 = 0 if the variables X and Y are independent. In that case the factors in the numerator are all zero. The statistic X2 can be written in the equivalent form X2 = n   I i=1 J j=1 n2 ij ni+n+j − 1   which emphasises the dependence of the statistic on the number of observations, n. This reveals a serious inconvenience – the value of X2 is an increasing function of the sample size n. To overcome such inconvenience, some alternative measures have been pro- posed, all functions of the previous statistic. Here is one of them: φ2 = X2 n = I i=1 J j=1 n2 ij ni+n+j − 1 This index is usually called the mean contingency, and the square root of φ2 is called the phi coefficient. For 2 × 2 contingency tables, representing binary variables, φ2 is normalised as it takes values between 0 and 1, and it can be shown that φ2 = Cov2(X, Y ) Var(X)Var(Y ) Therefore in the case of 2 × 2tables,φ2 is equivalent to the squared linear corre- lation coefficient. For contingency tables bigger than 2 × 2, φ2 is not normalised. To obtain a normalised index, useful for comparison, use a different modification of X2 called the Cramer index. Following an approach quite common in descrip- tive statistic, the Cramer index is obtained by dividing the φ2 statistic by the maximum value it can assume, for the structure of the given contingency table. Since such maximum is the minimum between the values I − 1andJ − 1, with I and J respectively the number of rows and columns of the contingency table, the Cramer index is equal to V 2 = X2 n min[(I − 1), (J − 1)] It can be shown that 0 ≤ V 2 ≤ 1foranyI × J contingency table, and V 2 = 0if and only if X and Y are independent. On the other hand, V 2 = 1 for maximum dependency between the two variables. Then three situations can be distinguished, referring without loss of generality to Table 3.8: a) There is maximum dependency of Y on X when in every row of the table there is only one non-zero frequency. This happens if to each level of X 56 APPLIED DATA MINING there, corresponds one and only one level of Y. This condition occurs when V 2 = 1andI ≥ J. b) There is maximum dependency of X on Y if in every column of the table there is only one non-null frequency. This means that to each level of Y there corresponds one and only one level of X. This condition occurs when V 2 = 1andJ ≥ I. c) If the two previous conditions are simultaneously satisfied, i.e. if I = J when V 2 = 1, the two variables are maximally interdependent. We have referred to the case of two-way contingency tables, involving two vari- ables, with an arbitrary number of levels. However, the measures presented here can easily be applied to multiway tables, extending the number of summands in the definition of X2 to account for all table cells. The association indexes based on the Pearson statistic X2 measure the dis- tance of the relationship between X and Y from the situation of independence. They refer to a generic notion of association, in the sense that they measure exclusively the distance from the independence situation, without giving infor- mation on the nature of that distance. On the other hand, these indexes are rather general, as they can be applied in the same fashion to all kinds of contingency table. Furthermore, as we shall see in Section 5.4, the statistic X2 has an asymp- totic probabilistic (theoretical) distribution, so it can also be used to assess an inferential threshold to evaluate inductively whether the examined variables are significantly dependent. Table 3.9 shows an example of calculating two X2-based measures. Several more applications are given in the second half of the book. 3.4.3 Dependency measures The association measures seen so far are all functions of the X2 statistics, so they are hardly interpretable in most real applications. This important aspect has been underlined by Goodman and Kruskal (1979), who have proposed an alternative Table 3.9 Comparison of association measures. Variable X2 V2 UY|X Sales variation 235.0549 0.2096 0.0759 Real estates 116.7520 0.1477 0.0514 Age of company 107.1921 0.1415 0.0420 Region of activity 99.8815 0.1366 0.0376 Number of employees 68.3589 0.1130 0.0335 Sector of activity 41.3668 0.0879 0.0187 Sales 23.3355 0.0660 0.0122 Revenues 21.8297 0.0639 0.0123 Age of owner 6.9214 0.0360 0.0032 Legal nature 4.7813 0.0299 0.0034 Leadership persistence 4.742 0.0298 0.0021 Type of activity 0.5013 −0.0097 0.0002 EXPLORATORY DATA ANALYSIS 57 approach to measuring the association in a contingency table. The set-up followed by Goodman and Kruskal is based on defining indexes for the specific context under investigation. In other words, the indexes are characterised by an operational meaning that defines the nature of the dependency between the available variables. Suppose that, in a two-way contingency table, Y is the ‘dependent’ variable and X is the ‘explanatory’ variable. It may be interesting to evaluate if, for a generic observation, knowledge of the X level is able to reduce uncertainty about the corresponding category of Y. The degree of uncertainty in the level of a qual- itative character is usually expressed using a heterogeneity index (Section 3.1). Let δ(Y) indicate a heterogeneity measure for the marginal distribution of Y, indicated by the vector of marginal relative frequencies, {f+1,f+2,...,f+J }. Similarly, let δ(Y|i) be the same measure calculated on the conditional distribution of Y to the ith row of the variable X of the contingency table,{f1|i,f2|i,...,fJ|i} see Section 2.4. An association index based on the ‘proportional reduction in the heterogeneity’, or error proportional reduction index (EPR), can then be calculated as follows (Agresti, 1990): EPR = δ(Y) − M[δ(Y|X)] δ(Y) where M[δ(Y|X)] indicates the mean heterogeneity calculated with respect to the distribution of X, namely M[δ(Y|X)] = i fi+δ(Y|i) where fi+ = ni+/n (i = 1, 2,...,I) This index measures the proportion of heterogeneity of Y (calculated through δ) that can be ‘explained’ by the relationship with X. Remarkably, its structure is analogous to that of the squared linear correlation coefficient (Section 4.3.3). By choosing δ appropriately, different association measures can be obtained. Usually the choice is between the Gini index and the entropy index. Using the Gini index in the EPR expression, we obtain the so-called concentration coefficient, τY|X: τY|X = f 2 ij /fi+ − f 2+j 1 − j f 2+j Using the entropy index in the ERP expression, we obtain the so-called uncer- tainty coefficient, UY|X: UY|X =− i j fij log(fij /fi+ · f+j ) j f+j log f+j 58 APPLIED DATA MINING In the case of null frequencies it is conventional to set log 0 = 0. Both τY|X and UY|X take values in the interval [0,1]. In particular, we can show that τY|X = UY|X if and only if the variables are independent τY|X = UY|X = 1 if and only if Y has maximum dependence on X The indexes have a simple operational interpretation regarding specific aspects of the dependence link between the variables. In particular, both τY|X and UY|X represent alternative quantifications of the reduction of the Y heterogeneity that can be explained through the dependence of Y on X. From this viewpoint they are rather specific in comparison to the distance measures of association. On the other hand, they are less general than the distance measures. Their application requires us to identify a causal link from one variable (explanatory) to the other (dependent), whereas the X2-based indexes are asymmetric. Furthermore, they cannot easily be extended to contingency tables with more than two ways, for obtaining an inferential threshold. Table 3.9 is an actual comparison between the distance measures and the uncertainty coefficient UY|X. It is based on data collected for a credit-scoring problem (Chapter 11). The objective of the analysis is to explore which of the explanatory variables described in the table (all qualitative or discrete quantita- tive) are most associated with the binary response variable. The response variable describes whether or not each of the 7134 business enterprises examined is credit- worthy. Note that the distance measures (X2 and Cramer’s V 2) are more variable and seem generally to indicate a higher degree of association. This is due to the more generic type of associations that they detect. The uncertainty coefficient is more informative. For instance, it can be said that the variable ‘sales variation’ reduces the degree of uncertainty on the reliability by 7.6%. On the other hand, X2 can be easily compared to the inferential threshold that can be associated with it. For instance, with a given level of significance, only the first eight variables are significantly associated with creditworthiness. 3.4.4 Model-based measures We can examine those association measures that do not depend on the marginal distributions of the variables. None of the previous measures satisfy this require- ment. We now consider a class of easily interpretable indexes that do not depend on the marginal distributions. These measures are based on probabilistic models and therefore allow an inferential treatment (Sections 5.4 and 5.5). For ease of notation, we shall assume a probabilistic model in which cell relative frequen- cies are replaced by cell probabilities. The cell probabilities can be interpreted as relative frequencies when the sample size tends to infinity, therefore they have the same properties as the relative frequencies. Consider a 2 × 2 contingency table, relative to the variables X and Y, respec- tively associated with the rows (X = 0,1) and columns (Y = 0,1) of the table. Let π11,π00,π10 and π01 indicate the probabilities that one observation is classified EXPLORATORY DATA ANALYSIS 59 in one of the four cells of the table. The odds ratio is a measure of association that constitutes a fundamental parameter in the statistical models for the analy- sis of qualitative data. Let π1|1 and π0|1 indicate the conditional probabilities of having a 1 (a success) and a 0 (a failure) in row 1; let π1|0 and π0|0 be, the same probabilities for row 0. The odds of success for row 1 are defined by odds1 = π1|1 π0|1 = P(Y = 1|X = 1) P(Y = 0|X = 1) The odds of success for row 0 are defined by odds0 = π1|0 π0|0 = P(Y = 1|X = 0) P(Y = 0|X = 0) The odds are always a non-negative quantity, with a value greater than 1 when a success (level 1) is more probable than a failure (level 0), that is when P(Y = 1|X = 1)>P(Y= 0|X = 1). For example, if the odds equal 4, this means that a success is four times more probable than a failure. In other words, it is expected to observe four successes for every failure. Instead odds = 1/4 = 0.25 means that a failure is four times more probable than a success; it is therefore expected to observe a success for every four failures. The ratio between the two previous odds is the odds ratio: θ = odds1 odds0 = π1|1/π0|1 π1|0/π0|0 From the definition of the odds, and using the definition of joint probability, it can easily be shown that: θ = π11π00 π10π01 This expression shows that the odds ratio is a cross product ratio, the product of probabilities on the main diagonal to the product of the probabilities on the secondary diagonal of a contingency table. In the actual computation of the odds ratio, the probabilities will be replaced with the observed frequencies, leading to the following expression: θij = n11n00 n10n01 Here are three properties of the odds ratio: • The odds ratio can be equal to any non-negative number; that is, it can take values in the interval [0, +∞). • When X and Y are independent π1|1 = π1|0, so that odds1 = odds0 and θ = 1. On the other hand, depending on whether the odds ratio is greater or less than 1, it is possible to evaluate the sign of the association: 60 APPLIED DATA MINING –Forθ>1 there is a positive association, since the odds of success are greater in row 1 than in row 0. –For0<θ<1 there is a negative association, since the odds of success are greater in row 0 that in row 1. • When the order of the rows or the order of the columns is reversed, the new value of θ is the reciprocal of the original value. The odds ratio does not change value when the orientation of the table is reversed so that the rows become columns and the columns become rows. This means that the odds ratio deals with the variables in a symmetrical way, so it is not necessary to identify a variable as dependent and the other as explanatory. The odds ratio can be used as an exploratory tool aimed at building a probabilistic model, similar to the linear correlation coefficient. In particular, we can construct a decision rule that allows us to establish whether a certain observed value of the odds ratio indicates a significant association between the corresponding vari- ables. In that sense it is possible to derive a confidence interval, as done for the correlation coefficient. The interval says that an association is significant when | log θij | >zα/2 ij 1√nij where zα/2 is the (1 − α/2) percentile of a standard normal distribution. For instance, when α = 5% then zα/2 = 1.96. The confidence interval used in this case is only approximate; the accuracy of the approximation improves with the sample size. Table 3.10 shows data on whether different visitors see the group pages cat- alog (C) and windows (W), from the database described in Chapter 8. From Table3.10wehavethat odds1 = P(C = 1|W = 1) P(C = 0|W = 1) = 0.1796 0.1295 = 1.387 odds0 = P(C = 1|W = 0) P(C = 0|W = 0) = 0.2738 0.4171 = 0.656 Therefore, when W is visited (W = 1) it is more likely that C also is visited (C = 1). When W is not visited (W = 0) then, C is not visited much (C = 0). Table 3.10 Observed contingency table between catalog and windows pages. W W = 0 W = 1 C = 0 0.4171 0.1295 C = 1 0.2738 0.1796 EXPLORATORY DATA ANALYSIS 61 The odds ratio turns out to be θ = 2.114, reflecting a positive (and significant) association between the two variables. So far we have defined the odds ratio for 2 × 2 contingency tables. But the odds ratios can be calculated in the same fashion for larger contingency tables. The odds ratio for I × J tables can be defined with reference to each of the I 2 = I(I − 2)/2 pairs of rows in combination with each of the J 2 = J(J − 2)/2 pairs of columns. There are I 2 J 2 odds ratios of this type. Evidently, the number of odds ratios becomes enormous, and it is wise to choose parsimonious representations. It may be useful to employ graphical representations of the odds ratio. For example, wanting to investigate the dependence of a dichotomous response variable from an explanatory variable with J levels, it can be effective to graphically represent the J odds ratios that are obtained by crossing the response variable with J binary variables describing the presence or absence of each level of the explanatory variable. 3.5 Reduction of dimensionality Multivariate analysis can often be made easier by reducing the dimensionality of the problem, expressed by the number of variables present. For example, it is impossible to visualise graphs for a dimension greater than 3. The technique that is typically used is the linear operation known as principal component transfor- mation. This technique can be used only for quantitative variables and, possibly for binary variables. But in practice it is often also applied to labelled qualita- tive data for exploratory purposes. The method is an important starting point for studying all dimensionality reduction techniques. The idea is to transform p statistical variables (usually correlated) in terms of k0then(x) increases as x increases. • When β<0thenπ(x) decreases as x increases. Furthermore, for β → 0 the curve tends to become a horizontal straight line. In particular, when β = 0, Y is independent of X. Although the probability of success is a logistic function and therefore not linear in the explanatory variables, the logarithm of the odds is a linear function of the explanatory variables: log  π(x) 1 − π(x)  = α + βx Positive log-odds favour Y = 1 whereas negative log-odds favour Y = 0. The log-odds expression establishes that the logit increases by β units for a unit increase in x. It could be used during the exploratory phase to evaluate the linear- ity of the observed logit. A good linear fit of the explanatory variable with respect to the observed logit will encourage us to apply the logistic regression model. The concept of odds was introduced in Section 3.4. For the logistic regression model, the odds of success can be expressed by π(x) 1 − π(x) = eα+βx = eα(eβ)x This exponential relationship offers a useful interpretation of the parameter β:a unit increase in x multiplies the odds by a factor eβ. In other words, the odds at level x + 1 equal the odds at level x multiplied by eβ.Whenβ = 0 we obtain eβ = 1, therefore the odds do not depend on X. What about the fitting algorithm, the properties of the residuals, and goodness of fit indexes? These concepts can be introduced by interpreting logistic regres- sion as a linear regression model for appropriate transformation of the variables. They are examined as part of the broader field of generalised linear models (Section 5.4), which should make them easier to understand. I have waited until Section 5.4 to give a real application of the model. 4.4.2 Discriminant analysis Linear regression and logistic regression models are essentially scoring mod- els – they assign a numerical score to each value to be predicted. These scores can be used to estimate the probability that the response variable assumes a predetermined set of values or levels (e.g. all positive values if the response is continuous or a level if it is binary). Scores can then be used to classify the observations into disjoint classes. This is particularly useful for classifying COMPUTATIONAL DATA MINING 99 new observations not already present in the database. This objective is more natural for logistic regression models, where predicted scores can be converted in binary values, thus classifying observations in two classes: those predicted to be 0 and those predicted to be 1. To do this, we need a threshold or cut-off rule. This type of predictive classification rule is studied by the classical the- ory of discriminant analysis. We will consider the simple and common case in which each observation is to be classified using a binary response: it is either in class 0 or in class 1. The more general case is similar, but more complex to illustrate. The choice between the two classes is usually based on a probabilistic criterion: choose the class with the highest probability of occurrence, on the basis of the observed data. This rationale, which is optimal when equal misclassification costs are assumed (Section 5.1), leads to an odds-based rule that allows us to assign an observation to class 1 (rather than class 0) when the odds in favour of class 1 are greater than 1, and vice versa. Logistic regression can be expressed as a linear function of log-odds, therefore a discriminant rule can be expressed in linear terms, by assigning the ith observations to class 1 if a + b1xi1 + b2xi2 +···+bkxik > 0 With a single predictor variable, the rule simplifies to a + bx i > 0 This rule is known as the logistic discriminant rule; it can be extended to quali- tative response variables with more than two classes. An alternative to logistic regression is linear discriminant analysis, also known as Fisher’s rule. It is based on the assumption that, for each given class of the response variable, the explanatory variables are distributed as a multivariate normal distribution (Section 5.1) with a common variance–covariance matrix. Then it is also possible to obtain a rule in linear terms. For a single predictor, the rule assigns observation i to class 1 if log n1 n0 − (x1 − x0)2 2s2 + xi(x1 − x0) s2 > 0 where n1 and n0 are the number of observations in classes 1 and 0; x1 and x0 are the observed means of the predictor X in the two classes, 1 and 0; s2 is the variance of X for all the observations. Both Fisher’s rule and the logistic discriminant rule can be expressed in linear terms, but the logistic rule is simpler to apply and interpret and it does not require any probabilistic assumptions. Fisher’s rule is more explicit than the logistic discriminant rule. By assuming a normal distribution, we can add more information to the rule, such as an assessment of its sampling variability. We shall return to discriminant analysis in Section 5.1. 100 APPLIED DATA MINING 4.5 Tree models While linear and logistic regression methods produce a score and then possibly a classification according to a discriminant rule, tree models begin by producing a classification of observations into groups and then obtain a score for each group. Tree models are usually divided into regression trees, when the response variable is continuous, and classification trees, when the response variable is quantitative discrete or qualitative (categorical, for short). However, as most concepts apply equally well to both, here we do not distinguish between them, unless otherwise specified. Tree models can be defined as a recursive procedure, through which asetofn statistical units are progressively divided into groups, according to a division rule that aims to maximise a homogeneity or purity measure of the response variable in each of the obtained groups. At each step of the procedure, a division rule is specified by the choice of an explanatory variable to split and the choice of a splitting rule for such variable, which establishes how to partition the observations. The main result of a tree model is a final partition of the observations. To achieve this it is necessary to specify stopping criteria for the division process. Suppose that a final partition has been reached, consisting of g groups (g
还剩378页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 8 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

37jjss

贡献于2015-09-30

下载需要 8 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf