Standardised Odds Variable DF Estimates Error Chi-Square Chi-Square Estimate Ratio INTERCEPT 1 -1.2163 0.1510 64.9210 0.0001 . . VDPFLRAT 1 1.5038 0.0999 226.6170 0.0001 0.365878 4.498 ISLANDS 1 -0.2474 0.1311 3.5639 0.0590 -0.050827 0.781 SOUTH 1 -0.4347 0.1222 12.6525 0.0004 -0.098742 0.647 CENTRE 1 -0.1371 0.1128 1.4777 0.2241 -0.033399 0.872 AGE1−89 1 0.4272 0.1312 10.6086 0.0011 0.095633 1.533 AGE36−50 1 0.7457 0.1070 48.5289 0.0001 0.205547 2.108 DIM−G 1 -0.0689 0.1335 0.2667 0.6055 -0.016728 0.933 DIM−M 1 0.1294 0.1172 1.2192 0.2695 0.035521 1.138 AGE15−35 0 0 . . . . . SEX 1 0.1180 0.0936 1.5878 0.2076 0.030974 1.125 NORTH 0 0 . . . . . DIM−P0 0 .. . .. AIC and SC in the ﬁrst two rows of the table are model choice criteria related to the numerator of the deviance. Chapter 6 covers them in more detail. The second part of the table shows, for each of the parameters, the estimates obtained plus the relative standard errors, as well as the Wald statistic for hypothesis testing on each coefﬁcient. Using the p-value corresponding to each statistic, we can deduce that at least four variables are not signiﬁcant (with a signiﬁcance level of 0.05, ﬁve variables are not signiﬁcant, as they have a greater p-value). Finally, the table shows the estimated odds ratios of the response variable with each explanatory variable. These estimates are derived using the estimated parameters, so they may differ from the odds ratios calculated during the exploratory phase (Section 4.4), which are based on a saturated model. The exploratory indexes are usually calculated marginally, whereas the present indexes take account of interactions among all variables. We now look at a model selection procedure to see whether the model can be further simpliﬁed. We can choose forward, backward or stepwise selection. 166 APPLIED DATA MINING In the normal linear model, these procedures are based on recursive application of the F test, but now they are based on the deviance differences. If the p-value for this difference is ‘large’, the simpler model will be chosen; if it is small, the more complex model will be chosen. The procedure stops when no further change will produce a signiﬁcant change in the deviance. Table 5.7 shows the results obtained with a forward procedure (generically known as ‘stepwise’ by the software). It highlights for every variable the values of Rao’s score statis- tic, in order to show the incremental importance of each inserted variable. The procedure stops after the insertion of ﬁve variables. Here they are in order of insertion: Vdpflrat, age15 35, north, age51 89, south.Noothervari- able is retained at a signiﬁcance level of α = 0.15, the software default. Table 5.7 also reports the parameter estimates for the ﬁve variables selected in the ﬁnal model, with the corresponding Wald statistics. Now, no variable appears to be not signiﬁcant, using a signiﬁcance level of 0.05. The variable Vdpflart indicates whether or not the price of the ﬁrst purchase is paid in instalments; it is decisively estimated to be the variable most associated with the response variable. For large samples, stepwise selection procedures, like the one we have just applied, might lead to high instability of the results. The forward and backward approaches may even lead to different ﬁnal models. Therefore it is a good idea to consider other model selection procedures too; this is discussed in Chapter 6. Figure 5.5 presents a ﬁnal diagnostic of the model, through analysis of the deviance residuals. It turns out that the standardised residuals behave quite well, lying in the interval [−2,+2]. But notice a slight decreasing tendency of the residuals (as opposed to being distributed around a constant line). This indicates a possible underestimation of observations with high success probability. For more details on residual analysis see Weisberg (1985). Table 5.7 Results of forward procedure. Summary of Stepwise Procedure Variable Number Score Wald Pr > Step Entered Removed In Chi-Square Chi-Square Chi-Square 1 VDPFLRAT 1 234.7 . 0.0001 2 AGE15−35 2 45.0708 . 0.0001 3 NORTH 3 9.4252 . 0.0021 4 AGE51−89 4 7.4656 . 0.0063 5 SOUTH 5 4.4325 . 0.0353 Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCEPT 1 -0.5281 0.0811 42.3568 0.0001 . . VDPFLRAT 1 1.5022 0.0997 226.9094 0.0001 0.365499 4.492 SOUTH 1 -0.2464 0.1172 4.4246 0.0354 -0.055982 0.782 AGE51−89 1 -0.3132 0.1130 7.6883 0.0056 -0.070108 0.731 AGE15−35 1 -0.7551 0.1063 50.5103 0.0001 -0.186949 0.470 NORTH 1 0.2044 0.0989 4.2728 0.0387 0.053802 1.227 STATISTICAL DATA MINING 167 Figure 5.5 Residual analysis for the ﬁtted logistic model. 5.5 Log-linear models We can distinguish between symmetric and asymmetric generalised linear mod- els. If the objective of the analysis is descriptive – to describe the associative structure among the variables – the model is called symmetric. If the variables are divided in two groups, response and explanatory – to predict the responses on the basis of the explanatory variables – the model is asymmetric. Asymmet- ric models we have seen are the normal linear model and the logistic regression model. We will now consider the most well-known symmetric model, the log- linear model. The log-linear model is typically used for analysing categorical data, organised in contingency tables. It represents an alternative way to express a joint probability distribution for the cells of a contingency table. Instead of listing all the cell probabilities, this distribution can be described using a more parsimonious expression given by the systematic component. 5.5.1 Construction of a log-linear model We now show how a log-linear model can be built, starting from three different distributional assumptions about the absolute frequencies of a contingency table, corresponding to different sampling schemes for the data in the table. For sim- plicity but without loss of generality, we consider a two-way contingency table of dimensions I × J (I rows and J columns). Scheme 1 The cell counts are independent random variables that are distributed according to a Poisson distribution. All the marginal counts, including the total number of observations n, are also random and distributed according to a Poisson distri- bution. As the natural parameter of a Poisson distribution with parameter mij is 168 APPLIED DATA MINING log(mij ), the relationship that links the expected value of each cell frequency mij to the systematic component is log(mij ) = ηij for i = 1,...,I and j = 1,...,J In the linear and logistic regression models, the total amount of information (which determines the degrees of freedom) is described by the number of obser- vations of the response variable (indicated by n), but in the log-linear model this corresponds to the number of cells of the contingency table. In the estimation procedure, the expected frequencies will be replaced by the observed frequencies, and this will lead us to estimate the parameters of the systematic component. For an I × J table there are two variables in the systematic component. Let the levels of the two variables be indicated by xi and xj ,fori = 1,...,I and j = 1,...,J. The systematic component can therefore be written as ηi = u + i uixi + j uj xj ij uij xixj This expression is called the log-linear expansion of the expected frequencies. The terms ui and uj describe the single effects of each variable, corresponding to the mean expected frequencies for each of their levels. The term uij describes the joint effect of the two variables on the expected frequencies. The term u is a constant that corresponds to the mean expected frequency over all table cells. Scheme 2 The total number of observations n is not random, but a ﬁxed constant. This implies that the relative frequencies follow a multinomial distribution. Such a distribution generalises the binomial to the case where there are more than two alternative events for the considered variable. The expected values of the absolute frequencies for each cell are given by mij = nπij . With n ﬁxed, specifying a statistical model for the probabilities πij is equivalent to modelling the expected frequencies mij , as in Scheme 1. Scheme 3 The marginal row (or column) frequencies are known. In this case it can be shown that the cell counts are distributed as a product of multinomial distributions. It is also possible to show that we can deﬁne a log-linear model in the same way as before. Properties of the log-linear model Besides being parsimonious, the log-linear model allows us easily to incorporate in the probability distribution constraints that specify independence relationships between variables. For example, using results introduced in Section 3.4, when two categorical variables are independent, the joint probability of each cell probability STATISTICAL DATA MINING 169 factorises as πij = πi+π+j ,fori = 1,...,I and j = 1,...,J. And the additive property of logarithms implies that log mij = log n + log πi+ + log π+j This describes a log-linear model of independence that is more parsimonious than the previous one, called the saturated model as it contains as many parameters as there are table cells. In general, to achieve a unique estimate on the basis of the observations, the number of terms in the log-linear expansion cannot be greater than the number of cells in the contingency table. This implies some constraints on the parameters of a log-linear model. Known as identiﬁability constraints, they can be deﬁned in different ways, but we will use a system of constraints that equates to zero all the u-terms that contain at least one index equal to the ﬁrst level of a variable. This implies that, for a 2 × 2 table, relative to the binary variables A and B, with levels 0 and 1, the most complex possible log-linear model (saturated) is deﬁned by log(mij ) = u + uA i + uB i + uAB ij with constraints such that: uA i = 0fori = 1 (i.e. if A = 1); uB i = 0forj = 1 (i.e. if B = 1); uAB ij = 0fori = 1andj = 1 (i.e. if A = 1andB = 1). The notation reveals that, in order to model the four cell frequencies in the table, there is a constant term, u; two main effects terms that exclusively depend on a variable, uA i and uB i ; and an interaction term that describes the association between the two variables, uAB ij . Therefore, following the stated constraints, the model establishes that the logarithms of the four expected cell frequencies are equal to the following expressions: log(m00) = u log(m10) = u + uA i log(m01) = u + uB i log(m11) = u + uA i + uB i + uAB ij 5.5.2 Interpretation of a log-linear model Logistic regression models with categorical explanatory variables (also called logit models) can be considered as a particular case of log-linear models. To clarify this point, consider a contingency table with three dimensions for variables A, B, C, and numbers of levels I, J, 2 respectively. Assume that C is the response variable of the logit model. A logit model is expressed by log mij1 mij0 = α + βA i + βB j + βAB ij 170 APPLIED DATA MINING All the explanatory variables of a logit model are categorical, so the effect of each variable (e.g. variable A) is indicated only by the coefﬁcient (e.g. βA i )rather than by the product (e.g. βA). Besides that, the logit model has been expressed in terms of the expected frequencies, rather than probabilities, as in the last section. This is only a notational change, obtained through multiplying the numerator and the denominator by n. This expression is useful to show that the logit model which has C as response variable is obtained as the difference between the log- linear expansions of log(mij1) and log(mij0). Indeed the log-linear expansion for a contingency table with three dimensions I × J × 2 has the more general form log(mijk ) = u + uA i + uB j + uC k + uAB ij + uAC ik + uBC jk + uABC ijk Substituting and taking the difference between the logarithms of the expected frequencies for C = 1andC = 0: log(mij1) − log(mij0) = uC 1 + uACi1 + uBCj1 + uABCij1 In other words, the u-terms that do not depend on the variable C cancel out. All the remaining terms depend on C. By eliminating the symbol C from the superscript, the value 1 from the subscript and relabelling the u-terms using α and β, we arrive at the desired expression for the logit model. Therefore a logit model can be obtained from a log-linear model. The difference is that a log- linear model contains not only the terms that describe the association between the explanatory variables and the response – here the pairs AC, BC – but also the terms that describe the association between the explanatory variables – here the pair AB. Logit models do not model the association between the explanatory variables. We now consider the relationship between log-linear models and odds ratios. The logarithm of the odds ratio between two variables is equal to the sum of the interaction u-terms that contain both variables. It follows that if in the considered log-linear expansion there are no u-terms containing both the variables A and B, say, then we obtain θAB = 1; that is, the two variables are independent. To illustrate this concept, consider a 2 × 2 table and the odds ratio between the binary variables A and B: θ = θ1 θ2 = π1|1/π0|1 π1|0/π0|0 = π11/π01 π10/π00 = π11π00 π01π10 Multiplying numerator and denominator by n2 and taking logarithms: log(θ) = log(m11) + log(m00) − log(m10) − log(m01) Substituting for each probability the corresponding log-linear expansion, we obtain log(θ) = uAB 11 . Therefore the odds ratio between the variables A and B is θ = exp(uAB 11 ). These relations, which are very useful for data interpretation, depend on the identiﬁability constraints we have adopted. For example, if we STATISTICAL DATA MINING 171 had used the default constraints of the SAS software, we would have obtained the relationship θ = exp(1/4uAB 11 ). We have shown the relationship between the odds ratio and the parameters of a log-linear model for a 2 × 2 contingency table. This result is valid for contingency tables of higher dimension, providing the variables are binary and, as usually happens in a descriptive context, the log-linear expansion does not contain interaction terms between more than two variables. 5.5.3 Graphical log-linear models A key instrument in understanding log-linear models, and graphical models in general, is the concept of conditional independence for a set of random variables; this extends the notion of statistical independence between two variables, seen in Section 3.4. Consider three random variables X, Y,andZ. X and Y are conditionally independent given Z if the joint probability distribution of X and Y, conditional on Z, can be decomposed into the product of two factors: the conditional density of X given Z and the conditional density of Y given Z.In formal terms, X and Y are conditionally independent on Z if f(x,y|Z = z) = f(x|Z = z)f (y|Z = z) and we write X ⊥ Y|Z. An alternative way of expressing this concept is that the conditional distribution of Y on both X and Z does not depend on X. So, for example, if X is a binary variable and Z is a discrete variable, then for every z and y we have f(y|X = 1,Z= z) = f(y|X = 0,Z= z) = f(y|Z = z) The notion of (marginal) independence between two random variables (Section 3.4) can be obtained as a special case of conditional independence. As seen for marginal independence, conditional independence can simplify the expression and interpretation of log-linear models. In particular, it can be extremely useful in visualising the associative structure among all variables at hand, using the so-called independence graphs. Indeed a subset of log-linear models, called graphical log-linear models, can be completely characterised in terms of conditional independence relationships and therefore graphs. For these models, each graph corresponds to a set of conditional independence constraints and each of these constraints can correspond to a particular log-linear expansion. The study of the relationship between conditional independence statements, represented in graphs, and log-linear models has its origins in the work of Darroch, Lauritzen and Speed (1980). We explain this relationship through an example. For a systematic treatment see Whittaker (1990), Edwards (1995), or Lauritzen (1996). I believe that the introduction of graphical log-linear models helps to explain the problem of model choice for log-linear models. Consider a contingency table of three dimensions, each one corresponding to a binary variable, so the total number of cells in the contingency table is 23 = 8. The simplest log-linear graphical model for a three-way contingency table assumes 172 APPLIED DATA MINING that the logarithm of the expected frequency of every cell is log(mjkl ) = u + uA j + uB k + uC l This model does not contain interaction terms between variables, therefore the three variables are mutually independent. In fact, the model can be expressed in terms of cell probabilities as pjkl = pj++p+k+p++l, where the symbol + indicates that the joint probabilities have been summed with respect to all the values of the relative index. Note that, for this model, the three odds ratios between the variables – (A, B), (A, C), (B, C) – are all equal to 1. To identify the model in a unique way it is possible to use a list of the terms, called generators, that correspond to the maximal terms of interaction in the model. These terms are called maximals in the sense that their presence implies the presence of interac- tion terms between subsets of their variables. At the same time, their existence in the model is not implied by any other term. For the previous model of mutual independence, the generators are (A, B, C); they are the main effect terms as there are no other terms in the model. To graphically represent conditional inde- pendence statements, we can use conditional independence graphs. These are built by associating a node to each variable and by placing a link (technically, an edge) to connect a pair of variables whenever the corresponding random vari- ables are dependent. For the cases of mutual independence we have described, there are no edges and therefore we obtain the representation in Figure 5.6. Consider now a more complex log-linear model among the three variables, described by the following log-linear expansion: log(mjkl ) = u + uA j + uB k + uC l + uAB jk + uAC jl In this case, since the maximal terms of interaction are uAB jk and uAC jl , the gener- ators of the model will be (AB, AC). Notice that the model can be reformulated in terms of cell probabilities as πjkl = πjk+πj+l πj++ or equivalently as πjkl πj++ = πjk+ πj++ πj+l πj++ A BC Figure 5.6 Conditional independence graph for mutual independence. TEAM FLY STATISTICAL DATA MINING 173 which, in terms of conditional independence, states that P(B = k,C = l|A = j) = P(B = k|A = j)P(C = l|A = j) This indicates that, in the conditional distribution (on A), B and C are indepen- dent. In other words, B ⊥ C|A. Therefore the conditional independence graph of the model is as shown in Figure 5.7. It can been demonstrated that, in this case, the odds ratio between all variable pairs are different from 1, whereas the two odds ratios for the two-way table between B and C, conditional on A, are both equal to 1. We ﬁnally consider the most complex (saturated) log-linear model for the three variables: log(mjkl ) = u + uA j + uB k + uC l + uAB jk + uAC jl + uBC kl + uABC jkl which has (ABC) as generator. This model does not establish any conditional independence constraints on cell probabilities. Correspondingly, all odds ratios, marginal and conditional, will be different from 1. The corresponding conditional independence graph will be completely connected. The previous model (AB, AC) can be considered as a particular case of the saturated model, obtained by setting uBC kl = 0forallk and l and uABC jkl = 0forallj, k, l. Equivalently, it is obtained by removing the edge between B and C in the completely connected graph, which corresponds to imposing the constraint, that B and C are conditionally independent on A. Notice that the mutual independence model is a particular case of the saturated model obtained by setting uBC kl = uAC jl = uAB jk = uABC jkl = 0forall j, k, l, or by removing all three edges in the complete graph. Consequently, the differences between log-linear models can be expressed in terms of differences between the parameters or as differences between graphical structures. I think it is easier to interpret differences between graphical structures. All the models in this example are graphical log-linear models. In general, graphical log-linear models are deﬁnable as log-linear models that have as gen- erators the cliques of the conditional independence graph. A clique is a subset of completely connected and maximal nodes in a graph. For example, in Figure 5.7 the subsets AB and AC are cliques, and they are the generators of the model. On the other hand, the subsets formed by the isolated nodes A, B and C are not cliques. To better understand the concept of a graphical log-linear model, A BC Figure 5.7 Conditional independence graph for B ⊥ C|A. 174 APPLIED DATA MINING consider a non-graphical model for the trivariate case. Take the model described by the generator (AB, AC, BC): log(mjkl ) = u + uA j + uB k + uC l + uAB jk + uAC jl + uBC kl Although this model differs from the saturated model by the absence of the three- way interaction term uABC jkl , its conditional independence graph is the same, with one single clique ABC. Therefore, since the model generator is different from the set of cliques, the model is not graphical. To conclude, in this section we have obtained a remarkable equivalence relation between conditional indepen- dence statements, graphical representations and probabilistic models, with the probabilistic models represented in terms of cell probabilities, log-linear models or sets of odds ratios. 5.5.4 Log-linear model comparison For log-linear models, including graphical log-linear models, we can apply the inferential theory derived for generalised linear models. We now insist on model comparison. This is because the use of conditional independence graphs per- mits us to interpret model comparison and choice between log-linear models in terms of comparisons between sets of conditional independence constraints. In data mining problems the number of log-linear models to compare increases rapidly with the number of considered variables. Therefore a valid approach may be to restrict the class of models. In particular, a parsimonious and efﬁ- cient way to analyse large contingency tables is to consider interaction terms in the log-linear expansion that involve at most two variables. The log-linear models in the resulting class are all graphical. Therefore we obtain an equiv- alence relationship between the absence of an edge between two nodes, say i and j, conditional independence between the corresponding variables, Xi and Xj (given the remaining ones), and nullity of the interaction parameter indexed by both of them. As we saw with generalised linear models, the most important tool for compar- ing models is the deviance. All three sampling schemes for log-linear models lead to an equivalent expression for the deviance. Consider, for simplicity, a log-linear model to analyse three categorical variables. The deviance of a model M is G2(M) = 2 jkl njkl log njkl ˆmjkl = 2 oi log oi ei where ˆmjkl = npjkl ,thepjkl are the maximum likelihood estimates of the cell probabilities, the oi are the observed cell frequencies and the ei indicate the cell frequencies estimated according to the model M. Notice the similarity with the deviance expression for the logistic regression model. What changes is essentially STATISTICAL DATA MINING 175 the way in which the cell probabilities are estimated. In the general case of a p-dimensional table, the deﬁnition is the same but the index set changes: G2(M0) = 2 i∈I ni log ni ˆm0i where, for a cell i belonging to the index set I, ni is the frequency of observations in the ith cell and ˆm0i are the expected frequencies for the considered model M0. For model comparison, two nested models M0 and M1 can be compared using the difference between their deviances: D = G2 0 − G2 1 = 2 i∈I ni log ni ˆm0i − 2 i∈I ni log ni ˆm1i = 2 i∈I ni log ˆm1i ˆm0i As in the general case, under H0, D has an asymptotic chi-squared distribution whose degrees of freedom are obtained by taking the difference in the number of parameters for models M0 and M1. The search for the best log-linear model can be carried out using a forward, backward or stepwise procedure. For graphical log-linear models we can also try adding or removing edges between variables rather than adding or remov- ing interaction parameters. In the backward procedure we compare the deviance between models that differ by the presence of an edge and at each step we elimi- nate the less signiﬁcant edge; the procedure stops when no arc removals produce a p-value greater than the chosen signiﬁcance level (e.g. 0.05). In the forward procedure we add the most signiﬁcance edges one at time until no arc additions produce a p-value lower than the chosen signiﬁcance level. 5.5.5 Application We can use a log-linear model to determine the associative structure among variables in a credit risk evaluation problem. The considered sample is made up of 8263 small and medium-sized Italian enterprises. The considered variables are A, a binary qualitative variable indicating whether the considered enterprise is deemed reliable (Good) or not (Bad); B, a qualitative variable with 4 levels that describes the age of the enterprise, measured from the year of its constitution; C, a qualitative variable with 3 levels that describes the legal status of the enterprise; D, a qualitative variable with 7 levels that describes the macroeconomic sector of activity of the enterprise; and E, a qualitative variable with 5 levels that describes the geographic area of residence of the enterprise. Therefore the data is classiﬁed in a contingency table of ﬁve dimensions and the total number of cells is 2 × 4 × 3 × 7 × 5 = 840. The objective of the analysis is to determine the associative structure present among the ﬁve variables. In the absence of a clear preliminary hypothesis on the associative structure, we will use a backward procedure for model comparison. Given the small number of variables to be determined, we can consider all log-linear models, including non-graphical models. 176 APPLIED DATA MINING The ﬁrst model to be ﬁtted is the saturated model, which contains 840 parame- ters, equal to the number of cells. Here is the corresponding log-linear expansion embodying the identiﬁability constraints described earlier: log µABCDE ijklm = u + uA i + uB j + uC k + uD l + uE m + uAB ij + uAC ik ++uAD il + uAE im + uBC jk + uBD jl + uBE jm + uCD kl + uCE km + uDE lm + uABC ijk + uABD ijl + uABE ijm + uBCD jkl + uBCE jkm + uCDE klm + uACD ikl + uACE ikm + uADE ilm + uBDE jlm + uABCD ijkl + uABCE ijkm + uBCDE jklm + uACDE iklm + uABDE ijlm + uABCDE ijklm Notice that the saturated model contains interaction terms of different order, for example, the constant (ﬁrst row), terms of order 2 (third row) and one term of order 5 (ﬁfth row). The backward strategy starts by comparing the saturated model with a sim- pler model that omits the interaction term of order 5. At a signiﬁcance level of 5% as the p-value for the deviance difference is 0.9946. We then look at interaction terms of order 4, removing them one at a time to ﬁnd the simpler model in each comparison. We continue through the terms and down the orders until we can achieve no more simpliﬁcation at our chosen signiﬁcance level of 5%. The ﬁnal model contains the constant term; the main effects and the interac- tions AB,AC,BC,AD,AE,BE,ABC that is, 6 interactions of order 2 and one interaction of order 3. These interaction terms can be described by the generators (AD,AE,BE,ABC). Figure 5.8 shows the conditional independence graph for the ﬁnal model. Notice that Figure 5.8 contains three cliques: ABC,andABE and AD.Since the cliques of the graph do not coincide with the generators of the log-linear model, the ﬁnal model is not graphical. The log-linear model without the order 3 interaction term ABC would have the generators (AB,AC,BC,AD,AE,BE) and would be graphical. But on the basis of deviance difference, we need to A B C DE Figure 5.8 Conditional independence graph for the ﬁnal selected model. STATISTICAL DATA MINING 177 include the order 3 interaction. The model could be converted into a logistic regression model for variable A with respect to all the others. Then the logistic regression model would also have to contain the explanatory variable B ∗ C,a multiplicative term that describes the joint effect of B and C on A. 5.6 Graphical models Graphical models are models that can be speciﬁed directly through conditional independence relationships among the variables, represented in a graph. Although the use of graphics with statistical models is not a new idea, the work of Dar- roch, Lauritzen and Speed (1980) has combined the two concepts in a new and illuminating way. They showed that a subset of the log-linear models, the log-linear graphical models, can be easily interpreted in terms of conditional independence relationships. This ﬁnding has led to the development of a wide class of statistical models, known as graphical models, which give us consider- able ﬂexibility in specifying models to analyse databases of whatever dimension, containing both qualitative and quantitative variables, and admitting both sym- metric and asymmetric relationships. Graphical models contain, as special cases, important classes of generalised linear models, such as the three seen in this book. For a detailed treatment see Whittaker (1990), Edwards (1995) or Lauri- tzen (1996). Here are some deﬁnitions we will need. A graph G = (V, E) is a struc- ture consisting of a ﬁnite number V of vertices (nodes) that correspond to the variables present in the model, and a ﬁnite number of edges between them. In general, the causal inﬂuence of a variable on another is indicated by a directed edge (shown using an arrow) while an undirected edge (shown using a line) represents a symmetric association. Figure 5.8 is an example of an undirected graph, containing only undirected edges. Figure 5.9 is a directed graph for the same type of application, where we have introduced a new variable, X,which corresponds to the return on investments of the enterprises. We have made a distinction between vertices that represent categorical variables (empty circles) and vertices that represent continuous variables (ﬁlled circles). Two vertices X and Y belonging to V are adjacent, written X ∼ Y,iftheyare connected by an undirected arc; that is, if both the pairs (X, Y )and(Y, X) belong A X C Figure 5.9 Example of a directed graph. 178 APPLIED DATA MINING to E. A node X is a parent of node Y, written X → Y, if they are connected by a directed edge from X to Y;thatis,if(X, Y ) ∈ E and (Y, X) /∈ E. A complete graph is a graph in which all pairs of vertices are connected by an edge. A sequence of vertices X0,...,Xn such that Xi−1 ∼ Xi for i = 1,...,n is called a path of length n. A graph is connected when there exists at least one path between each pairs of vertices. These are only some of the properties we can deﬁne for graphical models, but they are sufﬁcient to understand the probabilistic assumptions implied by a conditional independence graph. In general, a graphical model is a family of probability distributions that incorporates the rules of conditional independence described by a graph. The key to interpreting a graphical model is the relationship between the graph and the probability distribution of the variables. It is possible to distinguish three types of graph, related to the three main classes of graphical models: • undirected graphs are used to model symmetric relations among the variables. (Figure 5.8); they give rise to the symmetric graphical models. • directed graphs are used to model asymmetric relations among the variables. (Figure 5.9); they give rise to recursive graphical models, also known as probabilistic expert systems. • chain graphs contain both undirected and directed edges, therefore they can model both symmetric and asymmetric relationships; they give rise to graph- ical chain models (Cox and Wermuth, 1996). 5.6.1 Symmetric graphical models In symmetric graphical models, the probability distribution is Markovian with respect to the speciﬁed undirected graph. This is equivalent to imposing on the distribution a number of probabilistic constraints known as Markov properties. The constraints can be expressed in terms of conditional independence relation- ships. Here are two Markov properties and how to interpret them: • For the pairwise Markov property, if two nodes are not adjacent in the ﬁxed graph, the two corresponding random variables will be conditionally inde- pendent, given the others. On the other, hand, if the speciﬁed probability distribution is such that X⊥Y| others, the edge between the nodes corre- sponding to X and Y has to be omitted from the graph. • For the global Markov property, if two sets of variables, U and V , are graph- ically separated by a third set of variables, W, then it holds that U⊥V |W.For example, consider four discrete random variables, W, X, Y,andZ, whose conditional independence relations are described by the graph in Figure 5.10, from which we have that W and Z are separated from X and Y,andY and Z are separated from X. A Markovian distribution with respect to the graph in Figure 5.10 has to satisfy the global Markov property and therefore it holds that W⊥Z|(X, Y ) and Y⊥Z|(W, X). STATISTICAL DATA MINING 179 W Y X Z Figure 5.10 Illustration of the global Markov property. It is useful to distinguish three types of symmetric graphical models: • Discrete graphical models coincide with log-linear graphical models and are used when all the available variables are categorical. • Graphical Gaussian models are used when the joint distribution of all variables is multivariate Gaussian. • Mixed graphical models are used for a mixture of categorical variables and multivariate Gaussian variables. We have seen discrete graphical models in the Section 5.5.3. A similar type of symmetric model, useful for descriptive data mining, can be introduced for con- tinuous variables. An exhaustive description of these models can be found in Whittaker (1990), who has called them Gaussian graphical models even though they were previously known in the statistical literature as covariance selection models (Dempster, 1972). For these models, it is assumed that Y = (Y1,...,Yq) is a vector of continuous variables with a normal multivariate distribution. Marko- vian properties allow us to show that two variables are conditionally independent on all the others, if and only if the element corresponding to the two variables in the inverse of the variance–covariance matrix is null. This is equivalent to saying that the partial correlation coefﬁcient between the two variables, given the others, is null. In terms of conditional independence graphs, given four variables X, Y, W, Z, if the elements of the inverse of the variance–covariance matrix kx,z and ky,w were null, the edges between the nodes X and Z and the nodes Y and W would have to be absent. From a statistical viewpoint, a graphical Gaussian model and, equivalently, a graphical representation are selected by suc- cessively testing hypotheses of edge removal or addition. This is equivalent to testing whether the corresponding partial correlation coefﬁcients are zero. Notice how the treatment of the continuous case is similar to the discrete case. This has allowed us to introduce a very general class of mixed symmetric graphical models. We now introduce them in a rather general way, including continuous and discrete graphical models as special cases. Let V = ∪ be the vertex set of a graph, partitioned in a set of || continuous variables, and a set of || discrete variables. If to each vertex v is associated a random variable Xv, the whole graph is associated with a random vector XV = (Xv,v ∈ V ). A mixed graphical model is deﬁned by a conditional Gaussian distribution for the 180 APPLIED DATA MINING vector XV . Partition XV into a vector X containing the categorical variables, and a vector X containing the continuous variables. Then XV follows a conditional Gaussian distribution if it satisﬁes these two conditions: • p(i) = P(X = i) > 0 • p(X|X = i) = N|| ξ(i),(i) where the symbol N indicates a Gaussian distribution of dimension || with mean vector ξ(i) = K(i)−1h(i) and variance–covariance matrix (i) = K(i)−1, pos- itive deﬁnite. In words, a random vector is distributed as a conditional Gaussian if the distribution of the categorical variables is described by a set of positive cell probabilities (this could happen through the speciﬁcation of a log-linear model) and the continuous variables are distributed, conditional on each joint level of the categorical variables, as a Gaussian distribution with a null mean vector and a variance–covariance matrix that can, in general, depend on the levels of the categorical variables. From a probabilistic viewpoint, a symmetric graphical model is speciﬁed by a graph and a family of probability distributions, which has Markov properties with respect to it. However, to use graphical models in real applications, it is necessary to completely specify the probability distribution, usually by estimating the unknown parameters on the basis of the data. This inferential task, usually accomplished by maximum likelihood estimation, is called quantitative learning. Furthermore, in data mining problems it is difﬁcult to avoid uncertainty when specifying a graphical structure, so alternative graphical representations have to be compared and selected, again on the basis of the available data; this con- stitutes the so-called structural learning task, usually tackled by deviance-based statistical tests. To demonstrate this approach, we can return to the European software industry application in Section 4.6 and try to describe the associative structure among all seven considered random variables. The graph in Figure 5.11 is based on hypothe- ses formulated through subject matter research by industrial economics experts; it shows conditional independence relationships between the available variables. One objective of the analysis is to verify whether the graph in Figure 5.11 can be simpliﬁed, maintaining a good ﬁt to the data (structural learning). Another objective is to verify some research hypothesis on the sign of the association between some variables (quantitative learning). We begin by assuming a probability distribution of conditional Gaussian type and given the reduced sample size (51 observations), a homogeneous model (Lauritzen, 1996). A homogeneous model means we assume the variance of the continuous variable does not depend on the level of the qualitative variables. So we can measure explicitly the effect of the continuous variable Y on the qualitative variables, we have decided to maintain, in all considered models, a link between Y and the qualitative variables, even when it is not signiﬁcant on the basis of the data. The symmetric model for the complete graph will therefore STATISTICAL DATA MINING 181 Y A S I MNH Figure 5.11 Initial conditional independence graph. contain a total of 129 parameters. It is opportune to start the selection from the initial research graph (Figure 5.11). Since the conditional Gaussian distribution has to be Markovian with respect to this graph, all the parameters containing the pairs {M,A}, {N,I}, {M,N}, {M,I}, {A, N}, {A, I} have to be 0, hence the total number of parameters in the model corresponding to Figure 5.11 is 29. Considering the low number of available observations, this model is clearly overparameterised. A very important characteristic of graphical models is to permit local calcula- tions on each clique of the graph (Frydenberg and Lauritzen, 1989). For instance, as the above model can be decomposed into 4 cliques, it is possible to estimate the parameters separately, on each clique, using the 51 available observations to estimate the 17 parameters of each marginal model. In fact, on the basis of a backward selection procedure using a signiﬁcance level of 5%, Giudici and Carota (1992) obtained the ﬁnal structural model shown in Figure 5.12. From the ﬁgure we deduce that the only direct signiﬁcant associations between qualitative variables are between the pairs {H,I}, {N,S} and {N,H}. These associations depend on the revenue Y but not on the remaining residual variables. Concerning quantitative learning, the same authors have used their ﬁnal model to calculate the Y A S I MNH Figure 5.12 Final conditional independence graph. 182 APPLIED DATA MINING odds ratios between the qualitative variables, conditional on the level of Y.They obtained the estimated conditional odds, relative to the pairs HI, NS,andNH: ˆIH|R = exp(0.278 + 0.139R) therefore ˆIH|R > 1forR > 0.135 (all enterprises) ˆNH|R = exp(−2.829 + 0.356R) therefore ˆNH|R > 1forR > 2856 (one enterprise) ˆNS|R = exp(−0.827 − 0.263R) therefore ˆNS|R > 1forR < 23.21 (23 enterprises) The signs of the association can be summarised as follows: the association between I and H is positive; the association between N and H is substan- tially negative; the association between N and S is positive only for enterprises having revenues less than the median. From an economic viewpoint, these associations have a simple interpretation. The relationship between I and H conﬁrms that enterprises which adopt a strategy of incremental innovations tend to increase their contacts with enterprises in the hardware sector. The strategy of creating radically new products is based on an opposite view. Looking at contacts exclusively within the software sector, small enterprises (having revenues less than the median) tend to fear their innovations could be stolen or imitated and they tend to not make contacts with other small companies. Large companies (having revenues greater than the median) do not fear initiations and tend to increase their contacts with other companies. 5.6.2 Recursive graphical models Recursive graphical models, also known as probabilistic expert systems, can be considered as an important and sophisticated tool for predictive data mining. Their fundamental assumption is that the variables can be partially ordered so that every variable is logically preceded by a set of others. This precedence can be interpreted as a probabilistic dependency and, more strongly, as a causal dependency. Both interpretations exist in the ﬁeld of probabilistic expert sys- tems and this is reﬂected in the terminology: casual network if there is a causal interpretation, belief network if there is no causal interpretation. To specify any recursive model, we need to specify a directed graph that establishes the (causal) relationships among the variables. Once this graph is speciﬁed, a recursive model is obtained by using a probability distribution that is Markov with respect to the graph (e.g. Lauritzen, 1996). The Markov properties include the following factorisation property of the probability distribution: f(x1,...xp) = p i=1 f(xi|pa(xi)) STATISTICAL DATA MINING 183 V1 V2 V3 V4 Figure 5.13 Example of a directed acyclic graph. where pa(xi) indicates the parent nodes of each of the p considered nodes. This speciﬁes that the probability distribution of the p variables is factorised in a series of local terms, each of which describes the dependency of each of the considered variables, xi, from the set of relevant explanatory variables, pa(xi). It is a constructive way to specify a directed graphical model using a (recursive) sequence of asymmetric models, each of which describes the predictor set of each variable. The conditional independence graphs are therefore directed, because the edges represent ordered pairs of vertices. They are also constrained to be acyclic–no sequence of connected vertices has to form a loop. Figure 5.13 is an example of an acyclic directed conditional in dependence graph. It states, for instance, that V3⊥V2|V4, and it corresponds to a recursive model that can be speciﬁed, for instance, by the following factorisation: p(V1,V2,V3,V4) = p(V1)p(V2|V1)p(V4|V1,V2)p(V3|V4) When it comes to specifying the local probability distributions (e.g. the distribu- tions of V1,V2,V3,andV4), we could specify a recursive model as a recursion of generalised linear models. For example, Figure 5.13 corresponds to a model deﬁned by a linear regression of V3 on the explanatory variable V4,ofV4 on the explanatory variables V1 and V2, and of V2 on the explanatory variable V1. From a predictive data mining perspective, the advantage of recursive models is that they are simple to specify, interpret and summarise, thanks to the fac- torisation of the model into many local models of reduced dimensions. On the other hand, they do involve more sophisticated statistical thinking, especially in the context of structural learning. However, few mainstream statistical packages implement these types of model. Chapter 8 considers an application of directed graphical models using a recursive logistic model. Directed graphical models are typically used in artiﬁcial intelligence applications, where they are known as probabilistic expert systems or Bayesian networks (e.g. Heckerman, 1997). Another important special case of directed graphical models are Markov chain models, which are particularly useful for modelling time series data. In particular, 184 APPLIED DATA MINING ﬁrst-order Markov models are characterised by a short memory, and assume that Xi+1⊥Xj* ChiSq TIN−MEAT −1.6186 0.2206 53.85 <.0001 MOZZAR −0.0100 0.1320 0.01 0.9396 TIN−MEAT*MOZZAR 0.6607 0.0660 100.31 <.0001 TUNNY −0.3920 0.0635 38.07 <.0001 TIN−MEAT*TUNNY 0.3994 0.0344 134.72 <.0001 MOZZAR*TUNNY 0.1483 0.0290 26.17 <.0001 COKE −0.1740 0.0750 5.38 0.0203 TIN−MEAT*COKE 0.2215 0.0501 19.58 <.0001 MOZZAR*COKE 0.0769 0.0326 5.58 0.0182 TUNNY*COKE 0.0592 0.0117 25.63 <.0001 CRACKERS −0.2079 0.1228 2.87 0.0904 TIN−MEAT*CRACKERS 0.4715 0.0768 37.71 <.0001 MOZZAR*CRACKERS 0.1389 0.0616 5.08 0.0242 TUNNY*CRACKERS 0.1504 0.0188 63.80 <.0001 (continued overleaf ) 216 APPLIED DATA MINING Table 7.6 (continued) Parameter Estimate Standard Error Chi- Square Pr > ChiSq COKE*CRACKERS 0.1068 0.0199 28.68 <.0001 PASTA −0.2935 0.0516 32.35 <.0001 TIN−MEAT*PASTA 0.0294 0.0346 0.72 0.3957 MOZZAR*PASTA 0.00751 0.0206 0.13 0.7156 TUNNY*PASTA 0.0872 0.00796 120.20 <.0001 COKE*PASTA 0.0267 0.00805 11.01 0.0009 CRACKERS*PASTA 0.0219 0.0144 2.30 0.1291 JUICES −0.3191 0.0807 15.62 <.0001 TIN−MEAT*JUICES 0.2942 0.0543 29.32 <.0001 MOZZAR*JUICES 0.1089 0.0347 9.84 0.0017 TUNNY*JUICES 0.0879 0.0126 48.95 <.0001 COKE*JUICES 0.1238 0.0119 107.57 <.0001 CRACKERS*JUICES 0.1683 0.0200 70.84 <.0001 PASTA*JUICES 0.0304 0.00901 11.41 0.0007 OIL 0.0318 0.1195 0.07 0.7902 TIN−MEAT*OIL 0.4508 0.0770 34.28 <.0001 MOZZAR*OIL 0.1343 0.0569 5.57 0.0183 TUNNY*OIL 0.1219 0.0180 45.90 <.0001 COKE*OIL 0.0466 0.0204 5.20 0.0226 CRACKERS*OIL 0.1644 0.0361 20.76 <.0001 PASTA*OIL 0.0792 0.0131 36.54 <.0001 JUICES*OIL 0.0630 0.0230 7.52 0.0061 TOMATO−J −0.1715 0.0712 5.80 0.0160 TIN−MEAT*TOMATO−J 0.2314 0.0469 24.34 <.0001 MOZZAR*TOMATO−J 0.1121 0.0284 15.62 <.0001 TUNNY*TOMATO−J 0.0605 0.0112 29.23 <.0001 COKE*TOMATO−J 0.0958 0.0108 78.92 <.0001 CRACKERS*TOMATO−J 0.0589 0.0211 7.77 0.0053 PASTA*TOMATO−J 0.1887 0.00747 637.43 <.0001 JUICES*TOMATO−J 0.0831 0.0122 46.54 <.0001 OIL*TOMATO−J 0.1780 0.0163 119.22 <.0001 BRIOCHES −0.2412 0.0620 15.14 0.0001 TIN−MEAT*BRIOCHES 0.1530 0.0414 13.64 0.0002 MOZZAR*BRIOCHES 0.0955 0.0254 14.17 0.0002 TUNNY*BRIOCHES 0.0774 0.00995 60.57 <.0001 COKE*BRIOCHES 0.0965 0.00966 99.71 <.0001 CRACKERS*BRIOCHES 0.1860 0.0156 141.60 <.0001 PASTA*BRIOCHES 0.0343 0.00689 24.81 <.0001 JUICES*BRIOCHES 0.2342 0.00962 592.76 <.0001 OIL*BRIOCHES 0.0251 0.0176 2.02 0.1552 TOMATO−J*BRIOCHES 0.0608 0.00967 39.54 <.0001 BEER −0.0287 0.0742 0.15 0.6987 TIN−MEAT*BEER 0.2098 0.0462 20.62 <.0001 MOZZAR*BEER 0.0700 0.0333 4.42 0.0356 MARKET BASKET ANALYSIS 217 Table 7.6 (continued) Parameter Estimate Standard Error Chi- Square Pr > ChiSq TUNNY*BEER 0.0864 0.0113 58.89 <.0001 COKE*BEER 0.2415 0.00965 626.64 <.0001 CRACKERS*BEER 0.0721 0.0210 11.76 0.0006 PASTA*BEER 0.00755 0.00802 0.89 0.3464 JUICES*BEER 0.1201 0.0119 102.14 <.0001 OIL*BEER 0.0805 0.0192 17.61 <.0001 TOMATO−J*BEER 0.0602 0.0111 29.56 <.0001 BRIOCHES*BEER 0.0621 0.00985 39.83 <.0001 FROZ−VEG −0.2247 0.0704 10.18 0.0014 TIN−MEAT*FROZ−VEG 0.1938 0.0490 15.64 <.0001 MOZZAR*FROZ−VEG 0.1211 0.0276 19.25 <.0001 TUNNY*FROZ−VEG 0.0634 0.0114 31.14 <.0001 COKE*FROZ−VEG 0.0398 0.0116 11.75 0.0006 CRACKERS*FROZ−VEG 0.0630 0.0214 8.70 0.0032 PASTA*FROZ−VEG 0.0381 0.00773 24.30 <.0001 JUICES*FROZ−VEG 0.0496 0.0129 14.76 0.0001 OIL*FROZ−VEG 0.0720 0.0188 14.59 0.0001 TOMATO−J*FROZ−VEG 0.0847 0.0106 63.29 <.0001 BRIOCHES*FROZ−VEG 0.0406 0.00993 16.75 <.0001 BEER*FROZ−VEG 0.0224 0.0118 3.61 0.0575 RICE −0.2743 0.0840 10.67 0.0011 TIN−MEAT*RICE 0.2987 0.0514 33.83 <.0001 MOZZAR*RICE 0.1887 0.0355 28.34 <.0001 TUNNY*RICE 0.1460 0.0131 124.67 <.0001 COKE*RICE 0.0626 0.0149 17.65 <.0001 CRACKERS*RICE 0.1909 0.0235 65.96 <.0001 PASTA*RICE 0.1481 0.00975 231.01 <.0001 JUICES*RICE 0.1024 0.0155 43.40 <.0001 OIL*RICE 0.1225 0.0237 26.75 <.0001 TOMATO−J*RICE 0.1090 0.0129 70.90 <.0001 BRIOCHES*RICE 0.0228 0.0128 3.15 0.0759 BEER*RICE 0.0362 0.0151 5.74 0.0166 FROZ−VEG*RICE 0.0949 0.0136 49.02 <.0001 F−FISH −0.0494 0.1337 0.14 0.7119 TIN−MEAT*F−FISH 0.4792 0.0894 28.74 <.0001 MOZZAR*F−FISH 0.2417 0.0527 21.04 <.0001 TUNNY*F−FISH 0.1034 0.0224 21.40 <.0001 COKE*F−FISH 0.0504 0.0258 3.82 0.0507 CRACKERS*F−FISH 0.1047 0.0494 4.48 0.0342 PASTA*F−FISH 0.0536 0.0156 11.78 0.0006 JUICES*F−FISH 0.1032 0.0274 14.22 0.0002 OIL*F−FISH 0.1232 0.0419 8.66 0.0033 TOMATO−J*F−FISH 0.0750 0.0221 11.49 0.0007 BRIOCHES*F−FISH 0.0545 0.0207 6.92 0.0085 (continued overleaf ) 218 APPLIED DATA MINING Table 7.6 (continued) Parameter Estimate Standard Error Chi- Square Pr > ChiSq BEER*F−FISH 0.0735 0.0243 9.16 0.0025 FROZ−VEG*F−FISH 0.2954 0.0169 305.75 <.0001 RICE*F−FISH 0.1711 0.0262 42.64 <.0001 ICECREAM −0.4074 0.1882 4.68 0.0304 TIN−MEAT*ICECREAM 0.6214 0.1579 15.49 <.0001 MOZZAR*ICECREAM 0.1597 0.0828 3.73 0.0536 TUNNY*ICECREAM 0.1106 0.0293 14.28 0.0002 COKE*ICECREAM 0.2095 0.0235 79.35 <.0001 CRACKERS*ICECREAM 0.2912 0.0417 48.72 <.0001 PASTA*ICECREAM −0.00983 0.0200 0.24 0.6233 JUICES*ICECREAM 0.2335 0.0255 83.69 <.0001 OIL*ICECREAM 0.1632 0.0534 9.33 0.0023 TOMATO−J*ICECREAM 0.0961 0.0286 11.31 0.0008 BRIOCHES*ICECREAM 0.1393 0.0220 40.05 <.0001 BEER*ICECREAM 0.1133 0.0278 16.57 <.0001 FROZ−VEG*ICECREAM 0.2202 0.0240 84.07 <.0001 RICE*ICECREAM 0.1967 0.0347 32.13 <.0001 F−FISH*ICECREAM 0.1872 0.0560 11.18 0.0008 The results to Table 7.6 are obtained using the SAS procedure CATMOD. To compare Table 7.6 with Table 7.5, recall that the odds ratio between each pair of parameters can be obtained by exponentiating a value which is about four times the estimated interaction parameter. Therefore the threshold odds ratio of 2, adopted in Table 7.5 corresponds to a threshold value of the interaction parameter equal to about 0.1732. We can thus select the interactions that pass this threshold and consider the corresponding pair as being (strongly) positively associated. From Table 7.6 it turns out that all interactions found in Table 7.5 remain strongly signiﬁcant, except for (rice, pasta), (brioches, ice cream) and (crackers, juices), which have an estimated odds ratio slightly lower than 2. Furthermore, there are 14 more (strongly) positive associations: 9 of them concern tinned meat associated with Coke, crackers, juices, oil, tomato, beer, frozen vegetables, frozen ﬁsh and ice cream; 3 of them concern ice cream associated with frozen vegetables, rice, and frozen ﬁsh; the ﬁnal 2 are (mozzarella, rice) and (crack- ers, rice). The difference with Table 7.5 is that we have now taken into account conditional dependences between the variables, and as all variables are positively associated, more interactions have been found signiﬁcant. Table 7.6 reveals, that no signiﬁcant negative interactions are found. 7.4.2 Association rules The most common way to analyse of market basket data is to use association rules, a local data mining method explained in Section 4.8. We begin with a MARKET BASKET ANALYSIS 219 simple setting. Consider the products ice cream and Coke. As order is not rel- evant, to study the association between the two products, the data set can be collapsed to the two-way contingency table in Table 7.4. This shows that the support for the rule ‘if ice cream, then Coke’ is support (ice cream −−−→ Coke) = 170 46 727 = 0.0036 indicating low support for the rule. This means these two products are bought together only occasionally. The support corresponds to only one of the four joint frequencies in Table 7.4, corresponding to the occurrence of both buying events. A support of 0.0036 means that only 0.36% of the transactions considered will have both ice cream and Coke in the basket. The support of an association rule is symmetric; the support of the rule ‘if Coke, then ice cream’ would be the same. The conﬁdence of a rule, even when calculated for an association, where order does not matter, depends on the body and head of the rule: conﬁdence (ice cream −−−→ Coke) = 170 769 = 0.22 which corresponds to the second row conditional frequency of Coke = 1, and conﬁdence (Coke −−−→ ice cream) = 170 4949 = 0.034 which corresponds to the second column conditional frequency of ice cream = 1. The conﬁdence is really a particular conditional frequency. In the ﬁrst case it indicates the proportion, among those that buy ice cream, of those that also buy Coke. In the second case it indicates the proportion, among those that buy Coke, of those that also buy ice cream. The lift is a normalised measure of interestingness; it is also symmetric: lift (ice cream −−−→ Coke) = 0.22 0.11 = 2 lift (Coke −−−→ ice cream) = 0.034 0.017 = 2 This is always the case, as can be seen from the formula in Section 4.8. Section 4.8 goes on to derive an asymptotic conﬁdence interval for the lift. Here the asymp- totic conﬁdence interval goes from 1.17 to 3.40, so the association can be con- sidered signiﬁcant. Notice that the odds ratio between the two products was calculated as 2.44, a rather similar value (and also with a signiﬁcant conﬁdence interval). The main difference is that the odds ratio depends explicitly on all four cell frequencies of a contingency table, whereas the lift is the ratio between the frequency of the levels (A = 1, B = 1) and the product of the two marginal frequencies, (A = 1) and 220 APPLIED DATA MINING (B = 1), so it depends only implicitly on the frequencies of the complementary events (A = 0, B = 0). In any case the support of the considered rule is rather limited – ice cream and Coke are present in only 0.36% of all transactions – therefore conclusions based on it may not be of much practical value, even when supported by a high conﬁdence and/or a high lift value. But this conclusion is relative; it depends on the support of other rules. To discover this and obtain a more comprehensive picture of the interesting association rules, we now move to a full application of association rule modelling. The Apriori algorithm and a threshold support rule of 0.05*support(mode), where mode is the rule with maximum support among all rules of a ﬁxed order, leads to the selection of several relevant rules. Table 7.7 presents the order 2 association rules with the highest support out of the 190 possible rules. For each rule it shows the lift, the support and the conﬁdence as well as the transaction count, which is the absolute frequency of the rule. It turns out that the (ice cream, Coke) rule does not appear among the most frequent. Other rules have a higher support, and the rule with the highest support is milk → pasta, which appears in almost 50% of the transactions. This is followed by biscuits → pasta, milk → biscuits, water → pasta and milk → water, all occurring in about 39% of the transactions. Table 7.7 Association rules with highest support. TEAM FLY MARKET BASKET ANALYSIS 221 Table 7.8 Association rules with highest conﬁdence. As Table 7.7 contains the rules with the highest support and as support is symmetric, it is obvious that reciprocal rules are always adjacent to each other. Table 7.8 presents the order 2 association rules with the highest conﬁdence out of the 190 possible rules. From Table 7.8 we can see, for example, that rice → pasta has a conﬁdence equal to 90.18. This means that if a transaction contains rice, it will also contain pasta about 90% of the time. On the other hand, pasta → rice is not among the rules in Table 7.8; it has a conﬁdence of 36.18. This can be interpreted as saying that if a transaction contains pasta, it will also contain rice only 36.18% of the time. Notice that the conﬁdence of pasta → rice can be obtained by multiplying the conﬁdence of rice → pasta by a factor of 0.401, which corresponds to the ratio between the number of transactions containing rice (1822) and the number of transactions containing pasta (4541), as obtained from Table 7.1. More generally, notice that in Table 7.8 the head of the rule is always either pasta or milk, as these are the most supported products. Finally, to obtain a relative measure of a rule’s interestingness, we can also consider the lift. Table 7.9 reports the rules with the highest lift out of the 190 possible rules. Notice that frozen ﬁsh → mozzarella and ice cream → crackers come ﬁrst, both with a lift of about 2.36. And notice that Coke → ice cream is well ranked. Table 7.9 can be usefully compared with the odds ratios in Table 7.5, 222 APPLIED DATA MINING Table 7.9 Association rules with highest lift. which were used to build the exploratory graph in Figure 7.1. It turns out that similar products appear in both tables, except for pasta, which appears only in the odds ratios table. However, the pairs are sometimes different. Isolated products in Figure 7.1 – milk, biscuits, water, coffee and yoghurt – have a high support but a low lift, therefore they do not appear in Table 7.9. The rules we obtain depend on the measure of interestingness that we choose. It is interesting to consider the rules jointly in terms of support, conﬁdence and lift. Figure 7.2 is a graphical representation of the rules with the highest support (i.e. the 28 rules in Table 7.7). Support decreases with distance up the vertical axis; conﬁdence decreases with distance along the horizontal axis. The larger the volume of the geometric ﬁgure, the greater the lift. Figure 7.2 is a valuable joint visualisation of the rules that need to be selected. For instance, we can see that pasta → milk has the highest conﬁdence of all the rules that have pasta in their body, but it has a low value of lift. Now consider associations between more than two products. Table 7.10 shows the associations with the highest support, for associations up to order 4. Notice that some order 2 associations in Table 7.7 are also present in Table 7.10. There are also some order 3 associations, typically for break items and for com- binations of isolated products such as pasta, milk and biscuits. Table 7.11 shows similar results, this time ordered by conﬁdence. MARKET BASKET ANALYSIS 223 milk biscuits coffee tunny yoghurt pasta water brioches frozen vegetables Right Hand of Rule 29.57-32.11 32.11-34.64 37.18-39.71 39.71-42.24 Support(%) Confidence(%) 45.01-47.18 47.48-49.96 49.96-59.85 59.85-62.32 Left Hand of Rule yoghurt frozen vegetables tunny brioches coffee water biscuits milk pasta Figure 7.2 Graphical representation of association rules, based on support, conﬁdence and lift. Notice that the conﬁdences values are now rather high. For instance, if a transaction contains tomato sauce, oil and crackers, it will certainly contain pasta as well, as the conﬁdence is equal to 100%. However, the support of this rule is only 1.41%. As in table 7.8, milk and pasta are the only heads selected. Unlike Table 7.10, there are no order 2 associations. Next we try a methodology based on tree models (Section 4.8). We have chosen pasta, the most frequent product and the most frequent head of the rule in the associations. We have built a tree model having pasta as target variable and all the other products as predictors. Among the different paths lead- ing to the terminal nodes, we consider those paths where all variables in the path have the value 1. These paths corresponds to rules with high conﬁdence. Using a CHAID tree (CART gives similar results), we obtain the following rules: tunny & tomato sauce → pasta tomato sauce & rice → pasta rice & biscuits → pasta 224 APPLIED DATA MINING Table 7.10 Higher-order association rules ordered in terms of support. and their respective measures of interestingness: lift 1.41, conﬁdence 95.24%, support 14.84% lift 1.44, conﬁdence 96.80%, support 12.14% lift 1.40, conﬁdence 94.23%, support 18.43% Notice that all three rules, have a high conﬁdence. This is to be expected, as a tree model tries to develop the best predictive rules for the target variable. 7.5 Model comparison It is quite difﬁcult to assess local models such as association rules, simply because a global measure of evaluation conﬂicts with the local nature of the model. Furthermore, as the idea of searching for local patterns and rules is very recent, there is little consensus in the data mining literature on how to measure their performance (Hand, Mannila and Smyth, 2001). A natural idea is to measure the utility of patterns in terms of how interesting or unexpected they are to the analyst. As it is quite difﬁcult to model an analyst’s opinion, we usually assume a situation of completely uninformed opinion. The measures of interestingness for MARKET BASKET ANALYSIS 225 Table 7.11 Higher-order association rules ordered in terms of conﬁdence. a rule (Section 4.8), can then be used to assess performance. In this case study we have considered support, conﬁdence and lift as the main measures for validating a set of association rules. But the needs of the user will govern which of these three is the best one for selecting a set of rules. The support can be used to assess the importance of a rule in terms of its frequency in the database; the conﬁdence can be used to investigate possible dependences between variables; and the lift can be used to measure the distance from the situation of independence. Ultimately, a set of rules has to be assessed on its ability to meet the analysis objectives. Here the objectives are primarily to reorganise the layout of a sales outlet and to plan promotions so as to increase revenues. Once the associations have been identiﬁed, it is possible to organise promotions within the outlet so the products that are put on offer at the same time are products which are not asso- ciated. Correspondingly, by putting one product on promotion, we also increase the sales of the associated products. At the beginning of the chapter we saw that odds ratios and log-linear models can also be employed to determine a global association structure between the buying variables; in this case traditional statistical measures, such as G2,orAIC and BIC, can be employed to assess the overall quality of a model. Although they have a different purpose, classiﬁcation trees can also be seen as a global model capable of producing an association structure. Although association rules are much easier to detect and interpret, I believe that good global modelling, as expressed by log-linear and tree models, allows more stable and coherent 226 APPLIED DATA MINING conclusions. With enough time and sufﬁcient knowledge to implement a global model, I think this approach should be preferred. 7.6 Summary report • Context: this case study concerns the understanding of associations between buying behaviours. A similar kind of analysis can be applied to problems in which the main objective is cross-selling to increase the number of products that are bought in a given commercial unit (a supermarket, a bank, a travel agency, or more generally, a company offering more than one product or services). A related class of problems arise in promotional campaigns: it is desirable to put on promotion the fewest possible number of products but to derive any beneﬁts on the largest possible number of products. This is achieved by an efﬁcient layout of the products, putting together those that are most associated with each other. • Objectives: the aim of the analysis is to track the most important buying patterns, where a pattern means a group of products bought together. The cho- sen measure of importance determines the results of the analysis. The most common measures refer to the occurrence probability of a certain sequence (support) or to the conditional probability of buying a certain product, hav- ing bought others (conﬁdence). There are strong analogies with the class of problems that can be analysed using the methods in Chapter 8. The crucial difference is that here we are not interested in the order in which the products are bought. • Organisation of the data: data is extracted from a large database containing all commercial transactions in a supermarket in a given amount of time. The transactions are made by someone holding one of the chain’s loyalty cards. Although data is structured in a transactional database, it can be simpliﬁed into a data matrix format, with rows identifying clients and columns associ- ated with binary variables describing whether or not each product has been bought. After some simpliﬁcation the data matrix contains 46 727 rows and 20 columns • Exploratory data analysis: exploratory data analysis was performed by look- ing at all pairwise odds ratios between the 20 products, for a total of 190 association measures. The results can be visualised graphically and already give important suggestions for the analysis objectives. • Model speciﬁcation: we compared association rules, originally devised for market basket analysis problems, with more structured log-linear models, the most important symmetric statistical models for analysing contingency table data. • Model comparison: it is rather difﬁcult to compare association rules, which are local, with log-linear models, which are global. The most effective measure of comparison has to be the practical utility of a rule; this can be measured in terms of cross-selling or by using the efﬁcacy of a promotional campaign. MARKET BASKET ANALYSIS 227 • Model interpretation: association rules seem to be easier to understand than the results from log-linear models, but it depends on how the results are presented. Results from a log-linear model can be expressed graphically, using dependences and odds ratios, and these measures can be understood. The advantage of log-linear models is that they are based on inferential statements and, can therefore provide conﬁdence intervals for an association statement or a threshold able to ‘ﬁlter’ out relevant rules from the many possible rules and in a coherent way. CHAPTER 8 Web clickstream analysis 8.1 Objectives of the analysis This case study considers how visitor behaviour on a website can be predicted by analysing existing data on the order in which the site’s webpages are visited. Whenever a user links to a website, the server keeps track of all their actions in a log ﬁle. What is captured is the click ﬂow, or clickstream, of the mouse and the keys the user employs to navigate the website. Usually every click of the mouse corresponds to the viewing of a webpage, therefore we can deﬁne the clickstream as the sequence of webpages requested. A user session describes the succession of webpages seen by one user during one period logged on to the web. This may contain pages from more than one site. A server session, or session, is the set of pages for one particular site during the user session. Collectively these pages are often known as a visit. The objective of the analysis is to show how web clickstream data can be used to understand the most likely navigation paths in a website, with the aim of predicting, possibly online, which pages a visitor will view, given the path they have taken so far. This can be very useful in ﬁnding the probability that a visitor will view a certain page, perhaps a buying page in an e-commerce site. It can also ﬁnd the probability of entering (or exiting) the website from any particular page. Note that since most pages are now dynamically generated, the idea of viewing a particular page may need to be replaced with the idea of viewing a particular class of page, or type of page; a class could be deﬁned by meta information in the header. 8.2 Description of the data The data set comes from the log ﬁle for an e-commerce site. The source of the data cannot be speciﬁed, but it is the website of a company that sells hardware and software products; it will be known as a webshop. The accesses to the website were registered in a log ﬁle for a period of about two years, 30 September 1997 to 30 June 1999. The log ﬁle was then processed to produce a data set called sequences. This data set contains the userid (c value), a variable with the date and the instant the visitor has linked to a speciﬁc page (c time)andthe Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 230 APPLIED DATA MINING Table 8.1 The considered data set. c−value c−time c−caller 70ee683a6df... 14OCT97:11:09:01 home 70ee683a6df... 14OCT97:11:09:08 catalog 70ee683a6df... 14OCT97:11:09:14 program 70ee683a6df... 14OCT97:11:09:23 product 70ee683a6df... 14OCT97:11:09:24 program webpage seen (c caller). Table 8.1 reports a small extract of the available data set, corresponding to one visit. Table 8.1 shows that the visitor corresponding to the identiﬁer (cookie) 70ee683a6df... entered the site on 14 October 1997 at 11:09:01 and visited, in sequence, the pages home, catalog, program, product, program, leaving the website at 11:09:24. The whole data set contains 250 711 observations, each corresponding to a click, that describe the navigation paths of 22 527 visitors among the 36 pages which compose the site of the webshop. The visitors are taken as unique; that is, no visitors appears with more than one session. But a page can occur more than once in the same session. This data set is an example of a transaction dataset. However, unlike the market basket data in Chapter 7, the order in which the pages are seen is important, and it may be the very objective of the analysis to understand the patterns that produce it. The data set can be used directly in a form like Table 8.1, with as many rows as the number of total clicks, to determine association and sequence rules. Alternatively, we can use a derived data set called visitors. It is organised by sessions and contains variables that can characterise each of these sessions. The variables include important quantitative variables, such as the total duration of the server session (length), the total number of clicks made in a session (clicks), and the time at which the session starts (start, setting 0 as midnight of the preceding day). More importantly for our analysis, this data set contains binary variables that describe whether each page is visited at least once (level 1) or not (level 0). Table 8.2 shows an extract from visitors that corresponds to the session in Table 8.1. The rows in Table 8.1 correspond to clicks, but the rows in Table 8.2 corre- spond to sessions (or, equivalently, visitors, as they are unique). There are as many rows as the total number of visits to the website. In particular, looking at the last ﬁve columns, we obtain a binary data matrix that expresses which pages, among the 36 considered, have been visited at least once in each session. Other binary variables, derived from the original 36, can be inserted in visitors; for instance, it is of interest for e-commerce to insert a variable Purchase Table 8.2 The derived data set. c−value c−time length clicks time home catalog addcart program product 70ee683a6df...14OCT97:11:09:01 24 5 11:09:01 1 1 0 1 1 WEB CLICKSTREAM ANALYSIS 231 that indicates whether the session has led to at least one commercial transaction. However, in this case study we will be mainly concerned with understanding the navigation patterns, so we will mainly consider the 36 binary variables describing the available webpages. To give an idea of the types of page, here are some of the most common ones: • Home: the homepage of the website. • Login: where a user has to enter their name and other personal information, during the ﬁrst registration, to access certain services and products reserved for customers. • Logpost: prompts a message that informs whether the login has been suc- cessful or whether it has failed. • Logout: where the user can leave the personal characterisation given in the login page. • Register: to be recognised later on, the visitor has to obtain a userid and password. • Regpost: shows the partial results of the registration, asking for missing information. • Results: once the registration is accomplished, this page summarises the information given. • Regform1: here the visitor has to insert data that enables them to buy a product; the data could be a personal identiﬁcation number. • Regform2: here the visitor has to subscribe to a contract in which they accept the conditions for online commerce. • Help: it answers questions that may arise during navigation through the website. • Fdback: a page that allows the user to go back to the previous page they have visited. • Fdpost: a page that allows the user to go back to a previously viewed page in determined areas of the site. • News: presents the most up-to-date products. • Shelf: contains a list of the programs that can be downloaded from the website. • Program: gives detailed information about the software programs that can be bought. • Promo: demonstrates the features of a program. • Download: allows the user to download software programs of interest. • Catalog: contains a complete list of products on sale in the website. • Product: shows detailed information on each product that can be purchased. • P info: sets out the payment terms for purchasing products on the website. • Addcart: where the virtual basket can be ﬁlled with items to be purchased. • Cart: shows the current status of the basket, i.e. the items it contains. • Mdfycart: allows the user to modify the current content of the basket, perhaps by removing an item. • Charge: indicates the required payment to buy the items contained in the basket. 232 APPLIED DATA MINING • Pay req: displays the amount ﬁnal amount to pay for the products in the basket. • Pay res: here the visitor agrees to pay, and payment data is inserted (e.g. credit card number). • Freeze: where the requested payment can be suspended, perhaps to add new products to the basket. The data set in this chapter has also been analysed by Blanc and Giudici (2002) and Di Scala and La Rocca (2002). 8.3 Exploratory data analysis Our main aim is to discover the most frequent sequence rules among the 36 binary variables describing whether any single page has been visited. To obtain valid conclusions, the considered data has to be homogeneous. To assess whether visitors (and, correspondingly, sequence) is homogeneous, we decide to do an exploratory analysis on it. The available quantitative variables, which are ancillary to the analysis, can be used in the exploratory phase. We now examine the univariate distributions of clicks, length and start. Table 8.3 shows summary measures for clicks. It turns out that the mean number of clicks is 11. However, the mean is affected by the presence of extreme observations to the right of the distribution, as the mode and the median consist, respectively, of 5 and 8 clicks. In most of the cases more than 5 clicks (1% quantile) and less than 47 clicks (99% quantile) are seen. The maximum number of clicks is 192; this appears as an anomalous observation. To investigate this further, we consider the boxplot in Figure 8.1. The distribution is heavily skewed to the right, and the skewness coefﬁcient in Table 8.3 is 4.7. Table 8.4 shows summary statistics for length. Notice that this distribution behaves like the previous one – the mean is considerably higher than the mode Table 8.3 Summary measures for click. Mean 11,13 clicks Mode 5 clicks Minimum 1 click Quantile 1% 5 clicks First quartile 6 clicks Median 8 clicks Third quartile 13 clicks Quantile 99% 47 clicks Maximum 192 clicks Skewness 4.7 Kurtosis 40.8 WEB CLICKSTREAM ANALYSIS 233 Figure 8.1 Boxplot of the variable clicks. Table 8.4 Summary measures for length. Mean 12.33 minutes Mode 2.08 minutes Minimum 0 seconds Quantile 1% 45 seconds First quartile 2.65 minutes Median 5.12 minutes Third quartile 11.4 minutes Quantile 99% 108 minutes Maximum 65.5 hours Asymmetry 58.6 Kurtosis 5666 and the median. This reﬂects the fact that there may be anomalous observations (sessions). Indeed, as the 95% quantile is equal to 47.6 minutes, we have that 95% of the sessions terminate after 47.6 minutes and 99% after 1 hour and 48 minutes. Notice that the minimum time is 0 seconds; this must be an error in the log ﬁle or in the transcription of the data from the log ﬁle, so this observation can be safely eliminated as an outlier. Table 8.5 shows summary statistics for start, which indicates the time when the website is ﬁrst entered. The distribution appears skewed to the left. The mean and median starting time of connection are around 14:30, the beginning of the afternoon. Figure 8.2 shows the boxplot of the distribution, measuring time in 234 APPLIED DATA MINING Table 8.5 Summary measures for start. Mean 14:30 Mode 9:49 Minimum 2:30 First quartile 11:11 Median 14:43 Third quartile 18:17 Quantile 99% 23:40 Maximum 23:59 Asymmetry −0.49 Kurtosis 0.19 Figure 8.2 Boxplot of the variable start. seconds. It shows a distribution that is indeed skewed to the left, but the skewness is reﬂected in a large bunch of observations in the left tail (night visitors) not in an asymmetric aspect of the central box (containing 75% of the observations). On the basis of the previous results we decide to remove the outliers. From now on, for clarity, we will express start in hours and length in minutes. We eliminate all observations above the 99th percentile of the distributions for clicks and length, which behave in a similar way. But we do not remove the possible outliers for start. This is because of the nature of start and the appearance of the observed distribution. Furthermore, these extreme users may well be the most valuable customers. After removing the outliers, visitors contains 22 152 observations, instead of the initial 22 527. We now consider a descriptive analysis of the 36 binary variables at hand, on the cleaned data set. The fourth column of Table 8.6 contains the relative frequency of visit (support) for each of the 28 most frequent pages. Furthermore, for e-commerce purposes, we have indicated the frequency distribution of the same variables, conditional on the values of purchase. The overall propor- tion of purchasers is 7.21%, therefore the third column can be obtained as a weighted average of the other two, with weights equal to 7.21% and 92.79%. WEB CLICKSTREAM ANALYSIS 235 Table 8.6 Frequency distribution of the visits, and of the purchase. Frequency of visit Purchase = 0 Purchase = 1 Frequency of visit ADDCART 26.33% 99.56% 31.61% AGB 5.12% 8.33% 5.35% CART 15.67% 55.73% 18.56% CATALOG 42.68% 68.13% 44.52% CHARGE 0.06% 1.63% 0.17% DOWNLOAD 3.96% 65.44% 8.40% FDBACK 2.47% 1.88% 2.42% FDPOST 0.78% 0.81% 0.79% FREEZE 19.00% 99.94% 24.84% HELP 8.49% 8.58% 8.50% HOME 49.32% 40.89% 48.71% LOGIN 40.18% 65.87% 42.03% LOGOUT 1.90% 4.01% 2.05% LOGPOST 30.42% 73.64% 33.53% MDFYCART 3.46% 10.33% 3.96% NEWS 9.60% 4.76% 9.25% P−INFO 49.52% 32.50% 48.29% PAY−REQ 10.45% 97.43% 16.73% PAY−RES 5.05% 90.61% 11.22% PRODUCT 83.69% 96.49% 84.62% PROGRAM 76.81% 75.89% 76.74% PROMO 0.27% 0.88% 0.32% REGFORM1 1.24% 2.07% 1.30% REGFORM2 1.28% 1.31% 1.29% REGISTER 26.17% 35.94% 26.88% REGPOST 16.79% 32.25% 17.91% RESULT 4.15% 4.45% 4.18% SHELF 15.90% 74.64% 20.13% 236 APPLIED DATA MINING Table 8.7 Conditional means of the quantitative variables with respect to purchase. Purchase N. Obs. Variable Mean 0 20555 clicks 10 length 9 start 14 1 1597 clicks 17 length 21 start 14 The most visited pages are, in decreasing order, product (84.62%), program (76.74%), home (48.71%), p info (48.29%) and catalog (48.71%). Although we are not interested here in studying the dependence of purchase on the other variables, we can draw interesting preliminary conclusions from Table 8.6. For instance, notice that, even if the 31.61% of visitors add at least one product to the virtual basket (addcart =1), the application of Bayes’ rule (Section 5.1) shows that the percentage of them who actually purchases something is only equal to (99.56% × 7.21%)/31.61% = 22.61%. Further conclusions about pur- chase can be drawn by considering the mean values of the quantitative variables, conditional on the values of purchase. They are described in Table 8.7. From Table 8.7 notice that the time of entering the website (start) is about the same, but the purchasers make, on average, more clicks (17 against 10) and stay longer in the site (21 minutes against 9 minutes) This could occur because purchasing might take a long time and require many clicks to conﬁrm the order. Given the heterogeneous nature of the navigators, we decide to perform a cluster analysis, in order to ﬁnd homogeneous clusters of behaviours. Our primary goal is not cluster analysis per se, so cluster analysis can be seen as preliminary to the local models we are seeking. The clustering variables we consider are the three quantitative variables start, length and clicks, plus the binary variable purchase, all of them instru- mental to our objective of understanding navigation patterns. We begin with a hierarchical method to ﬁnd the number of groups and then try a non-hierarchical method to allocate observations into the determined number of groups. We use the Euclidean distance as our distance function and Ward’s method (Section 4.2) as our hierarchical method, after some comparative experiments. To allocate obser- vations we choose the K-means non-hierarchical method. From the hierarchical method, we obtain the number of clusters as 4. In fact, a further reduction in the number of clusters leads to a noticeable decrease in R2 and an increase in SPRSQ. This can be seen in Figure 8.3, which plots R2 and SPRSQ versus the number of groups in the hierarchical agglomerative algorithm. We then apply the K-means algorithm to the data, with the number of clusters set to 4. Table 8.8 summarises the results; it shows the size of each cluster and the mean values of the four variables used in the classiﬁcation. For purchase WEB CLICKSTREAM ANALYSIS 237 Figure 8.3 Behaviour of R2 (increasing) and SPRSQ (decreasing). Table 8.8 Results of the cluster analysis. Cluster N. obs. Variables Cluster mean Overall mean 1 8802 clicks 8 10 length 6 min 10 min start 18 h 14 h purchase 0.034 0.072 2 2859 clicks 22 length 17 min start 15 h purchase 0.241 3 1240 clicks 18 length 59 min start 13 h purchase 0.194 4 9251 clicks 8 length 6 min start 10 h purchase 0.039 the mean represents the proportion of actual purchases. Notice that there are two bigger groups, 1 and 4. From this ﬁnal cluster allocation we have R2 = 59%, which indicates a good performance given the complexity of the data. To better understand the cluster conﬁgurations, the percentages of visits to the 36 webpages can be calculated separately for each cluster. There is not enough space 238 APPLIED DATA MINING to publish all the results, so here is a summary description of the mean proﬁle of each cluster, obtained by interpreting the percentages and the conditional means in Table 8.8: • Cluster 1: they connect in the late afternoon, at about 18; they visit about 2 pages (2 less than the overall mean); the length of the visit is about 6 minutes (4 less than the overall mean), and they purchase less than the mean (about 3.4% against 7.21%). They are particularly interested in the pages product (82% viewed) and program (75.74% viewed); the other pages are viewed in a proportion well below the overall mean. • Cluster 2: they link up at about 15; they visit many pages (22, 12 more than the overall mean); they stay connected for a long time (17 minutes) and purchase much more than the others (24% against the overall 7.22%). Con- sequently, they are very interested in all pages related to purchase: product (95% viewed), addcart (63% viewed) and freeze (55% viewed). They are also frequent visitors, as the register page is visited less than the average (47%). • Cluster 3: they link up at about 13; they visit many pages (18); they stay connected for a very long time (1 hour) and purchase more than the others (19% against the overall 7.22%). They seem to be very similar to the visitors in cluster 2 with respect to the visited pages, but they seem to be less frequent visitors and they stay connected for longer. • Cluster 4: they link up at about 10, four hours before than the mean; they visit fewer pages (8); they stay connected as the overall population (10 minutes) and purchase little (only 3.9%) They are similar to the visitors in cluster 1, as they often visit product (83%) and program (75%). The difference between the two clusters seems to be in the starting time. The results obtained from cluster analysis conﬁrm heterogeneity of behaviours. To ﬁnd navigation patterns, we decide to concentrate on only one cluster and we choose cluster 3. This choice is obviously subjective but it has two important features. First, the visitors in this cluster stay connected for a long time and visit many pages. This will help us explore the navigation sequences between the webpages. Second, this cluster has a high probability of purchase; it seems important to consider the typical navigation pattern for a group of high purchasers. We shall therefore consider a reduced data set, corresponding to cluster 3, with 1240 sessions and 21 889 clicks. 8.4 Model building 8.4.1 Sequence rules To meet the objectives in Section 8.1 and for the type of data available, we begin by applying the associative local models in Section 4.8. Since the transac- tions considered here are visit sessions, the order in which pages are visited is WEB CLICKSTREAM ANALYSIS 239 important, so we shall consider the application of sequence rules, which are asso- ciative rules ordered by a variable. In our case such variable is start. Sequence rules are to be calculated on the sequence data set corresponding to cluster 3, which looks like the data in Table 8.1. Furthermore, to obtain useful results, we have appended two extra pages: at the beginning of each session there is a start session page and at the end of each session there is an end session page. The total number of pages is therefore 38. To choose a sequence model, we have applied the Apriori algorithm implemented in SAS Enterprise Miner, setting a support threshold equal to 0.05 × support (mode), where support(mode) indicates the support of the most seen page among the 38 considered. As a result of the algorithm, we can obtain the sequence rules with the highest interestingness, as measured by the support and conﬁdence. Consider a sequence A → B. Suppose for simplicity that it is an indirect sequence of order 2 and indicate with NA→B the number of visits which appear in the sequence at least once. Let N be the total number of server sessions. Based on Section 4.8, the Table 8.9 The most frequent indirect sequences of order 2. 240 APPLIED DATA MINING support for A → B is obtained by dividing the number of server sessions which satisfy the rule by the total number of server sessions. Therefore the support is a relative frequency that indicates the percentage of users for which the two pages have been visited in succession. The conﬁdence of A → B is obtained by taking the number of server sessions which satisfy the rule and dividing it by the number of sessions containing page A. Therefore the conﬁdence expresses the probability that during a server session where page A hasbeenviewedthen page B has subsequently been viewed. Table 8.9 reports the most likely sequences of order 2, for the sequence data set of cluster 3. In other words, we search a pattern of sequences characterised by maximum support among all sequences of order 2. Notice that most of the sequences involve start session and/or end session; this is obvious as each session will contain both of them and we are considering indirect rules. Among the other sequences, program → product is the sequence with the highest frequency, followed by product → product; product → product will not be detected by an association rule that is not sequential. Table 8.9 estab- lishes that the estimated probability is 64.68% that the product page is seen at least twice in a session. We now search the most supported indirect sequences among those of any order. Table 8.10 reports the results from the Apriori algorithm. Table 8.10 The most frequent indirect sequences of any order. WEB CLICKSTREAM ANALYSIS 241 Table 8.11 The most frequent direct sequences of order 2. Besides those already examined, notice that the most frequent sequences are sequences to do with visiting the most seen pages in the cluster, namely start session → product → end session and start session → program → end session. We now search the most likely direct sequences. Although direct sequences of order 2 can be easily obtained from the software, this is not the case for higher-order sequences. Table 8.11 reports the most likely direct sequences of order 2. Notice that program → product is the most likely direct sequence, but its support is now about 0.60, against 0.73 for indirect sequences (Table 8.10). This means that other pages are viewed between program and product on 13% of occasions. Table 8.11 also allows us to make sensible conclusions about which pages are the entrance pages: home, program, product, logpost. These are listed among those having the highest support of occurring with start session. But there are no sequences containing end session as head of a rule; this means that the site is typically entered from a few pages (the four that are listed) but it is exited from a wide variety of 242 APPLIED DATA MINING pages. Sequence rules can also be deduced by using a classiﬁcation tree that has as its target variable one of the considered pages, which acts as a supervisor. 8.4.2 Link analysis We now consider how we can take the results from the sequence rules and use link analysis to build up a global model. In the Enterprise Miner implementation, link analysis takes as its input those sequences having a prespeciﬁed support. For comparison we will consider all indirect sequences of any order up to a maximum of 10. As in Section 8.4.1, we will use the default threshold of support(mode), which is 0.05. Link analysis considers each of the obtained sequences as a row observation in a data set called link. It then counts how many of the observations include a certain sequence. This is called the count of a sequence and is the fundamental measure for link analysis. The main output of a link analysis is a graph of nodes and links. Figure 8.4 shows the graph for our webshop data. In the graph each page is represented by a node, and links are drawn between the nodes. A link is drawn between two nodes if the count of the corresponding sequence of order 2 is non-null (i.e. the sequence is contained at least once in the link data set). The graph therefore tells us which nodes are connected and which are not. Usually the thickness of a link is directly related to the size of the count. For instance, the link between home and program is quite Figure 8.4 Link analysis graph. WEB CLICKSTREAM ANALYSIS 243 thick, as the corresponding sequence appears often in the link data set. Links are directed; for instance, to orientate an edge between two nodes A, B the two counts of A → B and B → A in the link data set are compared and the higher count determines the orientation. Both orientations will be present when there is substantial parity. For instance, shelf precedes download in the graph, as the count of shelf → download is higher than that of download → shelf. The size of the nodes typically depends on a so-called centrality measure. This concepts comes from ideas in social networks. A ﬁrst-order centrality measure (C1) basically means that the importance of a node depends on the number of connections it has. On the other hand, a second-order centrality measure (C2) means that the importance of a node depends on the number of connections that the nodes connected to it have. In both cases each connection and link can be weighted according to its count in the link data set. We will describe the size of a node using an unweighted ﬁrst-order centrality measure. Therefore in Figure 8.4 the pages end session, product, program and shelf turn out to be the largest nodes, as they are connected to many others. Table 8.12 gives the values of the variables used to plot Figure 8.4. The position of a node may also depend on the counts. The counts between each pair of pages are put in a proximity matrix. Multidimensional scaling is then used to reproduce these proximities with a bidimensional Euclidean distance, and correspondingly to derive two (X, Y) points. The higher the count, the closer the points on a Cartesian graph. For instance, product and program have a large count and their points are close together. Table 8.12 Coordinates from the link analysis. 244 APPLIED DATA MINING 8.4.3 Probabilistic expert systems Probabilistic expert systems build up global statistical models by means of sub- sequent local factorisations. Although there are strong similarities with sequence rules, the difference is substantial. In the application considered here, sequence rules state that the visit to page B depends on the visit to page A, and establishes that this is true if the support (and possibly the conﬁdence) of the rule A → B is higher than a preﬁxed threshold. In a probabilistic expert system the binary random variable B depends on the occurrences of the binary random vari- able A if P(B|A,others)= P(B|others), where others are the other variables considered. Therefore probabilistic expert systems are global models for the dependence between variables, whereas association rules are local models for the dependence between the occurrence of events. In our present context the pages start session and end session are not random variables, hence they will not appear in the model, which will contain at most 36 variables. A further difference is that discrete probabilistic expert systems are actually cal- culated using contingency tables in a similar way to odds ratios (Chapter 7), therefore they cannot easily take account of temporal order. To compare probabilistic expert systems and association rules, we will ﬁt a probabilistic expert system to the available data set, with 1240 sessions. This methodology is not yet fully implemented in SAS, so we have to build it using a sequence of logistic regressions. The problem with this approach is that no a priori ordering of the variables is given, so each attempted logistic regression has been based on all the chosen variables, except the one adopted as response. Alternatively, we could have used a software package that supports probabilistic expert systems, such as Hugin (www.hugin.com). The graphical model in Figure 8.5 is has been obtained on the basis of 17 logistic regression models, one for each of the 17 pages with highest support. The relevant page is taken as the target variable and the other 16 binary variables are used as explanatory variables. The model is built from the logistic regres- sion results by supposing that if a signiﬁcantly positive odds ratio occurs, this corresponds to the existence of a link from the relevant explanatory variable to the target variable. The link is represented on the graph by an arrow from the explanatory variable to the target variable. Because we have no a priori logical ordering of the variables, we have not obtained a single model, but a graph that can be compatible with different orderings. Figure 8.5 can be compared with the results obtained from the indirect and direct sequence rules. In general, sequences can contain start session and end session, but the expert systems cannot contain them as they are not random variables. Consequently, most of the selected sequences are not in the graph. Another main difference is that the expert systems cannot contain rules of the type A → A, they are based on an unordered data set (visitors); this is especially noticeable with indirect rules. Here most of the direct rules chosen correspond to a link in the expert systems. Indeed probabilistic expert systems can be deemed closer to direct rules. The model in Figure 8.5 determines the pages variable that most inﬂuences webpage selection; since this model does not WEB CLICKSTREAM ANALYSIS 245 addcart cart freeze login logpost regpost register shelf download pay_res pay_req program help catalog home product p_info Figure 8.5 Directed graph built by recursive logistic regression. consider the order of the visits, these pages can precede or follow the visit to the target page. In the graph there are double arrows, so the model is not a true probabilistic expert system; that would require us to specify the order of the variables. For instance, we could use the order speciﬁed by the selected indirect or direct rules. Even though this is a global model, it can still give precise local statements. For instance, the odds ratios for the variable product with all other 16 variables can be estimated from the local logistic regression of product on all others. These odds ratios are reported in Table 8.13. It turns out that product is signiﬁcantly positively associated with p info, pay req, addcart and program,and these are precisely the nodes to which it is linked in Figure 8.5. Blanc and Tarantola (2002) give a different analysis of this data set, comparing probabilistic expert systems and dependency networks, a class of more elaborate recursive graphical models (Heckerman et al., 2000). 8.4.4 Markov chains Probabilistic expert systems are global models for analysing the dependence between variables; they are different from sequence rules, which model depen- dence between events. Markov chain models, (Section 5.7), can be used as global models for sequences of events. Here we consider discrete Markov chains. The idea is to introduce dependence between time-speciﬁc variables. In each session, to each time point i, corresponding to the ith click, there corresponds a discrete random variable with as many levels as the number of pages; these are called 246 APPLIED DATA MINING Table 8.13 Estimated odds ratios of the page product. Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits PROGRAM 9.461 5.105 17.534 HOME 0.593 0.321 1.096 CATALOG 1.358 0.724 2.547 LOGIN 2.066 0.790 5.403 ADDCART 17.889 5.305 60.323 LOGPOST 0.494 0.190 1.284 P−INFO 36.115 10.821 120.535 FREEZE 2.136 0.357 12.769 PAY−REQ 21.256 0.489 924.314 REGISTER 0.315 0.140 0.708 SHELF 0.384 0.161 0.916 CART 0.375 0.168 0.836 PAY−RES 0.202 0.007 5.906 REGPOST 1.663 0.645 4.288 DOWNLOAD 0.353 0.125 0.997 HELP 1.613 0.543 4.794 states of the chain. The observed ith page in the session is the observed realisa- tion of the Markov chain, at time i, for that session. Time can go from i = 1to i = T ,andT can be any ﬁnite number. A session can stop well before T ;inthis case the last page viewed is an absorbing state (end session for our data). A Markov chain model establishes a probabilistic dependence between what is seen before time i and what will be seen at time i.Inparticular,aﬁrst-order Markov chain, which is the model we consider here, establishes that what is seen at time i depends only on what is seen at time i − 1. This short-memory dependence can be assessed by a transition matrix that establishes the proba- bility of going from any one page to any other page in a single step. For 36 pages there are 36 × 36 probabilities of this kind. The conditional probabilities in the transition matrix can be estimated from the available conditional frequen- cies. By making the assumption that the transition matrix is constant in time (homogeneity of the Markov chain), we can use the frequencies of any two adjacent pairs of time-ordered clicks to estimate the conditional probabilities. Note the analogy of Markov chains with direct sequences. Conditional prob- abilities in a ﬁrst-order Markov model correspond to the conﬁdence of order 2 direct sequence rules, therefore a ﬁrst-order Markov chain is a model for direct sequences of order 2. Furthermore, it can be shown that a second-order Markov model is a model for direct sequences of order 3, and so on. The difference is that the Markov chain model is global not local. This is mainly reﬂected in the fact that Markov chains consider all pages, not just those with a high support. The Markov model is a probabilistic model, hence it allows inferential results to be obtained. WEB CLICKSTREAM ANALYSIS 247 Table 8.14 Thedatasetpage for the Markov chain model. To build a Markov chain model (ﬁrst-order and homogeneous), we have reorganised the data in a new data set called page, in which there are two vari- ables: page now and page tomorrow. page now indicates what is viewed by the visitor at a certain click and page tomorrow indicates what is viewed immediately afterwards. Table 8.14 is an extract from this data set. Each row in Table 8.14 corresponds to a pair of pages that are seen one after the other. For a given visitor, the second term in a row becomes the ﬁrst term in the next row. For a new visitor, we start with a new pair of pages and apply the same rule. The transition matrix can be calculated from this data set. It is a table with 37 rows (the 36 pages and start session) and 37 columns (the 36 pages and end session). The rows represent page now (they thus exclude end session, which cannot appear ﬁrst) and the columns represent page tomorrow (they exclude start session, which cannot appear sec- ond). Therefore a transition matrix contains a total of 37 × 37 = 1369 esti- mated probabilities. There is not enough space to publish the whole transition matrix. First of all, we can evaluate where a visitor is most likely to enter the site. To obtain this we have to consider the transition probabilities of the row start session. Figure 8.6 is a graph of the probabilities, excluding those estimated to have a null probability. The most frequent entrance page is home (48.81%), followed by program (17.02%), product (14.52%), logpost (9.19%) and catalog (6.77%). This is consistent with the nature of the visi- tors belonging to this cluster. Unlike what happens with association rules, the transition probabilities sum to 1 by row and column. We can then consider the 248 APPLIED DATA MINING most likely exit pages. To obtain this we have to consider the transition proba- bilities of the column end session. Figure 8.7 is a graph of the probabilities, excluding those estimated to have a null probability. The most likely exit page is product (20.81%), followed by logpost (12.10%), download (10.56%), home (9.19%) and p info (6.94%). From the transition matrix we can establish a path that connects nodes through the most likely transitions. One possibility is to proceed forward from start session.Fromstart session we can connect to the page with which start session has the highest probability of transition; this is home. Next follows program,thenproduct and then p info.Fromp info the Figure 8.6 Representation of the transition probabilities from the page start−session. Figure 8.7 Representation of the transition probabilities from the page end−session. WEB CLICKSTREAM ANALYSIS 249 most likely transition is back to product, so the path ends. This information can be represented graphically as in Figure 8.8, which includes estimates for the conditional probabilities of transition. We can compare what the previous path would be, using the conﬁdence index in the sequence rule algorithm. Initially, from start session the highest conﬁdence is reached for home (45.81%, as obtained by Markov chains), followed by program (20.39%, more than with Markov chains), product (78.09% as against 70.18%). The next most probable link is now with addcart (28.79%) rather than p info. The differences with Markov chains are due to the fact that the direct sequences consider only pages which pass a certain support. Following similar logic, we can also construct a backward path as in Figure 8.9, which is built starting from end session. The page with the highest transition probability to end session is product (20.81%). Then the highest transition to product is program (42.41%), and so on. Here the path is complete, as is it reaches start session. The paths in Figure 8.8 and 8.9 are not the most likely ones; these could be calculated by comparing all Markov chains of all orders, a formidable computational task. Alternatively, sequence rules can be used (Section 8.4.1); they are local, easy to calculate and easy to implement, even for large data sets. But because they are local, they are not normalised, therefore they do not sum to 1. We conclude this Markov chain analysis by describing the estimated transitions for some typical pages: • addcart: the highest transition is to freeze (57.01%), followed by mod- ifycart (8.24%), addcart (7.49%), login (7.38%). On the other hand, addcart is mostly reached by product (78.82%). start_session p_info program producthome 45.81% 17.80% 70.18% 26.73% Figure 8.8 The forward path of the most likely transitions. end_session home program catalogproduct start_session 20.81% 42.14% 38.41% 29.22% 22.28% Figure 8.9 The backward path of the most likely transitions. 250 APPLIED DATA MINING • shelf: once the programs in shelf are seen, most visits go to download (39.35%), back to shelf (16.06%) or to cart (8.09%). • download: once a program is downloaded, visitors return to the same page (46.88%), or leave the site (end session is 20.47%) and on 13.44% of occasions they go to shelf. • catalog: after catalog, the most seen page is program (78.73%), while all others are visited on less than 5% of occasions. On the other hand, catalog is reached from home (22.28%), logpost (13.60%) and prod- uct (10.87%) • pay res: having consented to pay, a visitor goes to shelf with a probability of 33.77%, to cart (14.83%) or to end session (11.70%). On the other hand, pay res reached from pay req on 81.55% of occasions. 8.5 Model comparison As discussed in Chapter 6, it is quite difﬁcult to evaluate local models. Here the situation is complicated by the fact that in Section 8.4 we compared local models with global models. For global models, such as probabilistic expert systems and Markov chains, statistical evaluation can proceed as in Sections 6.2, 6.4 and 6.5, or in terms of computationally intensive methods. But the real problem is how to compare them with sequence rules. In the absence of other considerations, a simple and natural scoring function for a sequence rule is its support, which gives the proportion of the population to which the rule applies. Alternatively, if the objective is to predict visiting behaviour, in terms of conditional prob- abilities of moving from one page to another, the conﬁdence or the lift may be used. Among the considered models, direct sequences and Markov chains are com- parable with each other and can be more easily interpreted, as they lead to statements having a unique meaning (these are direct sequence rules to which a conditional probability is attached). Indirect sequences can be seen as more exploratory; they lead to the link analysis representation, which is a very helpful global picture of the relationships between pages. On the other hand, probabilistic expert systems do model problems of a different nature, in which the modelled dependency is between random variables rather than event occurrences. We now compare brieﬂy the results obtained with direct sequence rules and Markov chain models. The set of rules selected in Table 8.11 can be scored in terms of support, as we have already done. To measure how usefully the body of a rule predicts the head of the rule, we can calculate its lift. We do this by taking each conﬁdence in the table and dividing it by the support of the corresponding head (which consists of a single page); these supports are given in Table 8.15. Dividing the conﬁdences in Table 8.11 by the corresponding supports in Table 8.15, we ﬁnd that 24 of them have a lift greater than one; some have a high lift value and are therefore very informative. This is the case for the sequences WEB CLICKSTREAM ANALYSIS 251 program → product (lift = 4.35) and catalog → program (lift = 7.64). The lift of 7.64 is saying that the probability of visiting program is about 7.64 times greater when catalog is seen beforehand than what it would be by chance. On the other hand, the rules product → product, start session → product and home → product are not informative, as their lift values are close to 1 (indeed slightly lower). Similar conclusions can be drawn from the Markov chain model; the lift of each transition can be calculated from dividing the transition probabilities by the initial state probabilities, corresponding to the probabilities which are (partly) reproduced in Table 8.15. It can thus be deduced, for example, that the lift of program → product is 0.7018/0.1793 = 3.91, and the lift of catalog → program is 0.7873/0.1077 = 7.31. Notice that both lifts are slightly lower than for sequence rules; this is because sequence rules only consider as transition candidates those pages whose joint support is higher than a certain threshold, whereas Markov chains consider them all. Table 8.15 Support of the most frequent pages in cluster 3. C−CALLER C−caller Frequency Percent product 4366 17.93% program 2622 10.77% home 1944 7.99% p−info 1715 7.04% catalog 1279 5.25% End−session 1240 5.09% start−session 1240 5.09% login 1135 4.66% freeze 1120 4.60% addcart 935 3.84% pay−req 879 3.61% logpost 857 3.52% shelf 803 3.30% cart 644 2.65% download 640 2.63% Register 631 2.59% pay−res 607 2.49% Regpost 504 2.07% 252 APPLIED DATA MINING Ultimately, though, a set of rules has to be assessed on its ability to meet the analysis objectives. In this case study the informative value of start session → end session is null, but in Table 8.10 it has the largest support and conﬁdence (100%). On the other hand, the informative value of the rules that go from start session to other pages, and from other pages to end session can be extremely important when designing the website. 8.6 Summary report • Context: this case study concerns the identiﬁcation of navigation patterns in a website. A similar kind of analysis can be applied to problems in which there is discrete time-ordered data identifying behavioural patterns; for instance, in consecutive commercial transactions (online or otherwise), or in study- ing people’s careers. The peculiarity of web behaviour is that the average number of transactions per individual, in a given period, is higher than in other contexts. • Objectives: the aim of the analysis is to track the most important patterns of visits, where a pattern means a time-ordered sequence of pages, possibly repeated. The chosen measure of importance determined the results of the analysis. The most common measures refer to the occurrence probability of a certain viewing sequence (support) or to the conditional probability of viewing a certain page, having viewed others in the past (conﬁdence). • Organisation of the data: data is extracted from the log ﬁle that registers the access to a website. It is structured in a transaction database. It may be further simpliﬁed into a data matrix format, but then the temporal order will be lost. • Exploratory data analysis: this phase of the analysis is necessary to draw valid conclusions. On the basis of three continuous variables related to visitor char- acteristics – the number of clicks in the session, the time of the connection and the length of the connection – we removed outlying observations, which are bound to arise in this context. Furthermore, a preliminary cluster analy- sis identiﬁed four distinct behaviours. This allows us to identify patterns for homogeneous groups of visitors. We concentrated on one cluster. • Model speciﬁcation: the data mining analysis needed here is an example of a local model or pattern. We therefore compared sequence rules, based on the Apriori algorithm, with link analysis. Furthermore, we considered more traditional statistical techniques, based on the whole data set but using local computations – probabilistic expert systems and Markov chain models. • Model comparison: it is rather difﬁcult to assess local models; consequently, it is also hard to compare them with global based statistical models. Another difﬁculty when comparing local models with global models is that they are based on different assumptions. Sequence rules can be compared directly with Markov chains. It turns out that the most probable patterns identiﬁed by the two procedures are rather similar. The choice between the two methods therefore depends on the scope of the rules themselves. WEB CLICKSTREAM ANALYSIS 253 • Model interpretation: the rules can be easily interpreted, but there may be a problem if the analysis produces a very large number. Statistical models, such as Markov chains, make it considerably easier to select the most relevant rules, as they can be evaluated in a more coherent way. CHAPTER 9 Proﬁling website visitors 9.1 Objectives of the analysis The aim of this case study is to analyse web access data to classify the visitors into homogeneous groups on the basis of their behaviour. This will lead us to identify typical visiting proﬁles. In other words, we are trying to match each visitor to a speciﬁc cluster, depending on their surﬁng habits on that particular site. This will give us a behavioural segmentation of the users that we can use in future marketing decisions. We can also monitor the evolution of the kind of ‘customer’ who comes to the site by looking at how the distribution of users in the different behavioural segments evolves over time. Then we can see how our business decisions affect the different segments, whether they increase visits by a particular segment or whether they decrease them. Examples of business decisions are changes to the webpages and the use of publicity. The data set in this chapter has also been analysed by Cadez et al. (2000) and Giudici and Castelo (2001); Giudici and Castelo take a Bayesian viewpoint. 9.2 Description of the data The data set contains data about the pages visited of the website www.microsoft.com by 32 711 anonymous visitors. For each visitor we have indicated the pages of the site that have been visited in the ﬁrst week of February 1998. Visitors are identiﬁed with an identiﬁcation number (from 10 001 to 42 711) and no personal information is given. The total number of visited pages is 296. The pages are identiﬁed by a number that corresponds to a title and a corresponding address. For example, number 1057 refers to the page ‘MS PowerPoint News’ of the PowerPoint group of pages. The numeric codes associated with the pages are integers that go from 1000 up to 1295. To give a better idea of the data set, here are its ﬁrst few lines: C, ‘‘10908’’, 10908 V, 1108 V, 1017 C, ‘‘10909’’, 10909 V, 1113 Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 256 APPLIED DATA MINING V, 1009 V,1034 C, ‘‘10910’’, 10910 V, 1026 V, 1017 Each visitor is indicated by a line (beginning with the letter C) that identiﬁes them using with a numerical code. The code is converted into a number. The visitor’s line is then followed by one or more lines that show which pages are visited. The pages that are not visited do not appear. Unlike in Chapter 8, this time it is convenient to work with a derived data matrix, organised by visitors. This matrix will describe, for each visitor, how many times each page has been viewed. We will therefore have one categorical variable for each page. And also unlike Chapter 8, we count how many times a page is viewed, not just whether the page has been viewed at least once. As a result, we now have a discrete quantitative random variable, not a binary variable. There is a problem with setting up the data matrix. The number of variables in the data set corresponds to the number of webpages viewed, and at 296 it is too large. It will be likely that many combinations of these variables will never arise, or they will arise very rarely, so there will not be much statistical information. To perform a valid cluster analysis of visitors into groups, we need to clean and summarise the original ﬁle so we obtain a less complex data matrix. To do this, we group the webpages into 13 homogeneous categories, reﬂecting their logical meaning in the Microsoft website. The number of discrete variables is then reduced from 296 to 13. Each variable corresponds to one of the 13 groups: • Initial: this includes all the general access pages and all the pages dedicated to research. • Support: this includes all the pages related to the requests for help and support. • Entertainment: this includes all the pages that refer to entertainment, games and cultural software. • Ofﬁce: this has all the pages which refer to the Ofﬁce software. • Windows: this groups together all the pages related to the Windows operat- ing system. • Othersoft: this refers to all the pages relating to software other than Ofﬁce. • Download: this includes all the pages regarding software downloading or updating. • Otherint: this has all the pages dedicated to services through the internet for IT professionals; these pages are different from the download pages. • Development: this has all the pages dedicated to professional developers (e.g. Java). • Hardware: this includes the pages relating to Microsoft hardware. • Business: this has pages dedicated to businesses. • Info: this includes all the pages which give information about new products and services. • Area: this has all the pages which refer to local access, depending on the speciﬁc language. PROFILING WEBSITE VISITORS 257 Table 9.1 Extract from the visitors data matrix. Using this grouping, we can derive a visitor data matrix with 32 711 rows and 13 columns. Table 9.1 shows part of it. Corresponding to each page there is a dis- crete variable that shows the number of times each person has visited the speciﬁc group of pages. It does not include information on the order in which the pages are visited. We have already commented on this point in Chapter 8. To achieve our initial aims, we carry out the analysis in three stages: (1) an exploratory phase looks at preliminary considerations; (2) an analysis phase determines the behavioural classes of users by applying cluster analysis and Kohonen maps; (3) a comparison phase evaluates the performance of the two descriptive models in terms of grouping and predictive ability. Figure 9.1 Frequency of visits to the webpages of the website being analysed. 258 APPLIED DATA MINING 9.3 Exploratory analysis The exploratory analysis reveals a high level of dispersion with respect to the pages visited. Figure 9.1 shows the absolute frequency distribution of visits to each page. Notice that the pages coded with the lowest numbers (from 1000 onwards) have the highest frequencies, and there appears to be a decreasing pattern, as higher-numbered pages are being considered. From a preliminary examination of the data it turns out that the most frequently visited page is page number 1008, ‘Free Downloads’. We have also discovered that each visitor looks, on average, at 4 pages, as this is the sample mean. However, the mode is only 2, reﬂecting a positively skewed distribution of the variable ‘number of visited pages’. Given the size of the available data, the number of pages to be considered (296) is too large for a valid categorical data analysis. If all variables were considered, there would be too many parameters to be estimated, with a consequent loss in statistical efﬁciency of the procedure. It becomes necessary to have a preliminary transformation of the data and in particular to group the pages together, as we discussed in Section 9.2. 9.4 Model building The ﬁrst part of the analysis aims to identify the different behavioural segments within the sample of users. We use two different descriptive data mining tech- niques: cluster analysis and the unsupervised networks known as Kohonen maps. Both techniques allow us to partition the data to identify homogeneous groups or types possessing internal cohesion that differentiates them from the other groups. We use two techniques so we can compare their efﬁciency, but also to check that they produce consistent results. We will use the results of the cluster anal- ysis to help us determine the optimal number of clusters for the Kohonen map implementation. 9.4.1 Cluster analysis Chapter 4 explained the main techniques of hierarchical cluster analysis as well as the non-hierarchical K-means method. A variation of the K-means method is used in the SAS procedure fastclus that we will employ. The basic idea is to introduce seeds, or centroids, to which statistical units may be attracted, forming a cluster. It is important to specify the maximum number of clusters, say G,in advance. As discussed in Section 4.2, hierarchical and non-hierarchical methods of cluster analysis do have some disadvantages. Hierarchical cluster analysis does not need to know the number of clusters in advance but it may require too much computing power. For moderately large quantities of data, as in this case study, the calculations may take a long time. Non-hierarchical methods are fast, but they require us to choose the number of clusters in advance. To avoid these disadvantages and to try to exploit the potential of both the methods, we can adopt two possible approaches. We can extract from the data a PROFILING WEBSITE VISITORS 259 sample of limited size, then carry out a hierarchical cluster analysis to determine G, the optimal number of clusters. Once we have a value for G we take the G means of the clusters as our seeds; then we continue with a non-hierarchical analysis on the whole data set using a number of clusters equal to G and allocating each observation to one of them. Alternatively we can work on the whole data set, carrying out a non-hierarchical analysis with a large value of G.Wethen consider a new data set, made up from the G group means, each endowed with two measurements, one indicating the cluster size and one the dispersion within the cluster. A hierarchical analysis is then carried out on this data set to see whether any groups can be merged. It is essential to indicate the frequency and the dispersion of each cluster, otherwise the analysis will not take account of clusters having different numbers and variabilities. The Clustering node of Enterprise Miner implements a mixture of both approaches in a three-stage procedure. Initially a non-hierarchical clustering procedure is run on the whole data, having chosen a large value of G. Seeds are initially set as the ﬁrst G available observations. Then an iterative procedure is run; at each step of the procedure, temporary clusters are formed, allocating each observation to the cluster with the seed nearest to it. Each time an observation is allocated to a cluster, the seed is substituted with the mean of the cluster, called the centroid. The process is repeated until convergence is achieved, namely, until there are no substantial changes in the cluster seeds. At the end of the procedure, a total of G clusters are available, with corresponding cluster centroids. This is the input of the second stage. In the second stage a hierarchical clustering method is run on a sample of the data to ﬁnd the optimal number of clusters. As the number of clusters cannot be greater than G, the procedure is agglomerative, starting at G and working downwards. The previous cluster means are used as seeds, and a non-hierarchical procedure is run to allocate the observations to the clusters. A peculiar aspect of this stage is that the optimal number of clusters is chosen with respect to a test statistic, a function of the R2 index known as the cubic clustering criterion (CCC). As discussed in Section 4.2, it may be too restrictive to assume, a Gaussian distribution for the observations to be clustered. But in order to derive a statis- tical test, we need to make some kind of assumptions. Suppose that we want to verify the signiﬁcance of a number of clusters equal to G. A rather general assumption is to assume that, under the null hypotheses H0, the observations are distributed uniformly over a hypercube with dimension equal to the number of variables, each cube representing a cluster, adjacent to the others. Under the alternative hypothesis, clusters are distributed as a mixture of multivariate Gaus- sian distributions, centred at the mean of each cluster, and with equal variances. The cubic clustering criterion is a function of the ratio between the observed R2 and the expected R2 under the null hypothesis. From empirical Monte Carlo studies, it turns out that a value of CCC greater than 2 represents sufﬁcient evi- dence against the null hypothesis and, therefore, for the validity of the chosen G clusters. Although it is approximate, the criterion tends to be conservative – it may have a bias towards a low number of clusters. 260 APPLIED DATA MINING Once the optimal number of clusters has been chosen, the algorithm pro- ceeds with a non-hierarchical clustering to allocate the observations into the G chosen groups, whose initial seeds are the centroids obtained in the previ- ous step. In this way, we obtain a ﬁnal conﬁguration of the observations. The procedure is similar to the ﬁrst non-hierarchical stage, and can be summarised as follows: Repeat the following two steps until convergence: (1) scan the data and assign each observation to the nearest seed (nearest using the Euclidean distance); (2) replace each seed with the mean of the observations assigned to its cluster. Here we choose G = 40 for the ﬁrst non-hierarchical stage. The hierarchical stage is then carried out on a sample of 2000 observations from the available data, obtained by sampling randomly with replacement. The distance function is the Euclidean distance, and Ward’s method is used to recompute the distances as the clusters are formed. To choose the number of clusters, we set the CCC threshold at 3 (rather conservative) and the maximum number of clusters at 40. To obtain valid cluster means for use as seeds in the third stage, we impose a minimum of 100 observations in each cluster. Table 9.2 is the dendrogram for the agglomerative clustering procedure, expressed in tabular form. Each of the 40 observations in fact corresponds to a cluster, as formed in the previous non-hierarchical stage. It is quite difﬁcult to choose the optimal number of clusters on the basis of R2 (RSQ) or the semipartial R2 (SPSRQ); a threshold criteria is deﬁnitely required. Figure 9.2 shows the behaviour of CCC as the number of clusters increases from 2 to 40. Notice that CCC reaches a threshold value of 3 when there are 10 groups. And in Table 9.2, when moving from 9 groups to 10, there is a considerable increase in SPRSQ from 0.0163 to 0.0165 and a corresponding reduction in RSQ from 0.440 to 0.424. A ﬁnal non-hierarchical procedure was therefore run on the whole data set using 10 clusters. The results of the ﬁnal procedure are shown in Table 9.3. For each of the 10 clusters, it gives the total number of observations in the cluster, (the fre- quency) and a measure of internal cohesion (the root mean square standard deviation within the cluster). It also gives the maximum distance from the cluster seed as a further measure of internal cohesion, and the distance from the cluster seed to the nearest other cluster seed. We have R2 = 0.45 for the ﬁnal conﬁguration, which can be treated as a summary evaluation measure of the model. To better interpret the cluster conﬁgurations, Table 9.4 gives the means of each cluster. Notice that clusters 1 and 3 have similar centroids, expressed by a PROFILING WEBSITE VISITORS 261 Table 9.2 Dendrogram of the ﬁrst-stage hierarchical cluster analysis. NCL –Clusters Joined – FREQ SPRSQ RSQ 39 OB15 OB18 2 0.0012 .626 38 OB17 OB34 2 0.0013 .625 37 OB16 OB40 4 0.0016 .623 36 OB5 CL39 3 0.0017 .622 35 OB9 OB14 5 0.0019 .620 34 OB19 OB35 6 0.0020 .618 33 OB7 CL37 10 0.0020 .616 32 OB20 OB33 2 0.0021 .614 31 OB28 OB39 8 0.0025 .611 30 OB13 CL34 22 0.0029 .608 29 OB24 CL31 9 0.0031 .605 28 OB11 OB23 29 0.0033 .602 27 OB10 OB21 11 0.0033 .599 26 CL35 CL38 7 0.0034 .595 25 CL26 OB31 18 0.0040 .591 24 CL36 CL29 12 0.0041 .587 23 OB2 OB12 51 0.0046 .583 22 CL33 CL32 12 0.0064 .576 21 OB29 OB32 57 0.0068 .569 20 CL25 CL30 40 0.0069 .562 19 OB4 OB8 80 0.0081 .554 18 OB30 OB36 204 0.0082 .546 17 CL24 CL27 23 0.0084 .538 16 OB3 OB27 159 0.0091 .529 15 OB6 OB37 174 0.0130 .516 14 CL28 CL21 86 0.0130 .503 13 CL23 CL20 91 0.0146 .488 12 CL22 CL14 98 0.0154 .473 11 OB1 OB22 971 0.0158 .457 10 OB25 OB38 156 0.0163 .440 9 CL11 CL10 1127 0.0165 .424 8 CL19 CL17 103 0.0172 .407 7 CL9 OB26 1171 0.0239 .383 6 CL13 CL16 250 0.0263 .357 5 CL7 CL18 1375 0.0269 .330 4 CL6 CL15 424 0.0380 .292 3 CL8 CL12 201 0.0653 .226 2 CL4 CL3 625 0.0834 .143 1 CL5 CL2 2000 0.1429 .000 similar mean number of mean visits to each page. On the other hand, clusters 4 and 9 appear to have rather different behaviours, concentrated mainly on one page (respectively, development and initial). Before interpreting the ﬁnal proﬁles in Table 9.4, we will have a look at Kohonen maps. 262 APPLIED DATA MINING Figure 9.2 Choice of the optimum number of clusters according to CCC. Table 9.3 Results from the ﬁnal K-means cluster conﬁguration. 9.4.2 Kohonen maps Kohonen maps require us to specify the number of rows and the number of columns in the grid space characterising the map. Large maps are usually the best choice, as long as each cluster has a signiﬁcant number of observations. The learning time increases signiﬁcantly with the size of the map. The number of rows and the number of columns are usually established by conducting several trials until a satisfactory result is obtained. We will use the results of the cluster analysis to help us. It makes sense in terms of optimality but it will also be useful when we compare the two techniques later on. Having identiﬁed 10 as the optimal number of clusters, we will consider a 5 × 2 map. The Kohonen mapping algorithm in SAS Enterprise Miner essentially replaces the third step of the clustering algorithm with the following iterative procedure: PROFILING WEBSITE VISITORS 263 Table 9.4 The K -means cluster means. 264 APPLIED DATA MINING Table 9.5 Results from the ﬁnal Kohonen map conﬁguration. Repeat the following two steps until convergence: (1) scan the data and assign each observation to the nearest seed (nearest using the Euclidean distance); (2) replace each seed with a weighted mean of the cluster means for the clusters that lie in the grid neighbourhood of the seed’s cluster The weights correspond to the frequencies of each cluster. In this way the cluster conﬁguration is such that any two clusters which are close to each other in the map grid will have centroids close to each other. The initial choice of the seeds can be made in different ways; we will choose them randomly. Alternatively, we could have used the centroids obtained from the second stage of the K-means clustering procedure. Table 9.5 shows the summary statistics for the ﬁnal 5 × 2 conﬁguration. For each of 10 map clusters, it gives the total number of observations in the cluster (the frequency) and the root mean square standard deviation within the cluster, as a measure of internal cohesion. It also gives the distance from the cluster seed to the nearest cluster seed. Table 9.5 should be compared with Table 9.3, which gives similar information for the K-means procedure. Notice that the groups in Table 9.5 are more homogeneous in their numbers of observations. In Table 9.3 there is one large cluster (cluster 3) but now most clusters have a similar size. We also have R2 = 0.51, which is 0.06 higher than we obtained using the K-means procedure. To better interpret the cluster conﬁgurations, Table 9.6 gives the centroids of each cluster in the ﬁnal Kohonen conﬁguration; it conﬁrms the ﬁndings in Table 9.4. For instance, clusters 1 and 6 seem to describe visitors that visit the initial pages a high number of times, whereas cluster 5 describes visitors that often click on the development pages. These behaviours correspond, respec- tively to clusters 4, 9 and 4, 5 in Table 9.4. Notice there is a degree of overlapping: two kinds of behaviour appear in the same cluster, number 4. 9.5 Model comparison An important consideration concerns the economic value of the results. Broadly speaking, a visitor proﬁle is better if the cluster proﬁles are more distinct and if their separation reﬂects a truly distinct behaviour rather than being due to randomness. We have already commented that the Kohonen map seems to do PROFILING WEBSITE VISITORS 265 Table 9.6 The Kohonen map cluster means. 266 APPLIED DATA MINING this better. It achieves it by exploiting the dependence between adjacent clusters. A second consideration is that the statistical evaluation of the results should be based mainly on R2, or measures derived from it, because this is a descriptive analysis. We have already seen that the overall R2 is larger with the Kohonen networks. It is interesting to examine for each variable (page) the ratio of the between sum of squares and the total sums of squares that lead to R2. This can give a measure of the goodness of the cluster representation, speciﬁc for each variable. By examining all such R2 we can get an overall picture of which aspects of the observations are more used in the clustering process. Table 9.7 presents all variable-speciﬁc R2 and the overall R2 for the K-means procedure and the Kohonen procedure. For both procedures the group pages that have a high R2 and are therefore most inﬂuential in determining the ﬁnal results, are Initial, help, office and download. There are also pages that are inﬂuential only for the K-means procedure: othersoftware and hardware. And, there are pages that are inﬂuential only for the Kohonen procedure: win- dows, download,and area. The choice between the two procedures therefore depends on which pages to choose as discriminant for the behaviour. In the absence of other considerations, the choice should consider the procedure which leads to the highest overall R2, here the Kohonen map. Further considerations may arise when the results will be used to make pre- dictions. For instance, suppose that once the grouping has been accomplished, we receive some new observations to be classiﬁed into clusters. One reasonable Table 9.7 Comparison of the variable-speciﬁc R2. Page R2 for (K-means) R2 for Kohonen map initial 0.60 0.65 support 0.68 0.66 entertainment 0.01 0.02 ofﬁce 0.44 0.70 windows 0.18 0.56 othersoft 0.47 0.07 download 0.04 0.43 otherint 0.21 0.15 development 0.79 0.64 hardware 0.49 0.03 business 0.02 0.02 infor 0.05 0.05 area 0.01 0.56 Overall 0.45 0.51 PROFILING WEBSITE VISITORS 267 way to proceed is to assign them to one of the clusters previously determined, according to a discriminant rule. Alternatively, the analysis could be redone on the old and new observations. The ﬁrst choice is more valid from a decision- making viewpoint. In that case, clustering methods can be compared in terms of predictive performance, perhaps by using cross-validation techniques. We begin by creating two data sets. We take the original data set and append one column of values, corresponding to the categorical variable that assigns each observation to a cluster (namely, the allocation variable). The allocation variable obviously differs between the K-means analysis and the Kohonen analysis, hence we create two different data sets. To compare the two clustering procedures, we then split the two data sets into a training sample with 75% of the observations and a validation sample with the remaining 25%. A classiﬁcation tree procedure is then run on both training data sets, with the allocation variable as the target response variable. We will take entropy as the impurity measure. We can then compare the predictive performance of the tree model on the two data sets, in terms of misclassiﬁcation errors. This is because we know, from the cluster analysis, the actual allocation of each observation in the validation sample. Note that the two data sets only differ for the target variable, and the tree model applied is the same. The difference in the misclassiﬁcation rates will therefore measure the performance of the two clustering methods in terms of predictive ability. For each cluster and for both methods, Table 9.8 shows the proportion of observations classiﬁed correctly; we can obtain the misclassiﬁcation rates by subtracting these values from 100%. Once again the Kohonen map performs better on this data set, as in 9 of the 10 cases it leads to a lower misclassiﬁcation error. Table 9.8 Comparison of the predictive performances between the K-means method and the Kohonen map network. Actual cluster Predicted cluster Percentage correctly classiﬁed by K-means Percentage correctly classiﬁed by Kohonen map 1 1 92 86 2 2 98 99 3 3 99 97 4 4 73 96 5 5 82 99 6 6 57 69 7 7 92 98 8 8 93 97 9 9 80 86 10 10 58 92 268 APPLIED DATA MINING Therefore, on the basis of all viewpoints, Kohonen maps are to be preferred for this problem. We now proceed with a more detailed interpretation of its results. Table 9.6 characterises each cluster in terms of the most clicked pages. But this interpretation can be improved by producing a graphical representation of the same information. More precisely, for each cluster we compare graphically the overall mean of each page variable with that found in the different visitor segments. For our study we use the mean number of visits made to the 13 areas for each behavioural proﬁle. By comparing the 13 mean values with the whole population (as if there were just one cluster), we can suggest an interpretation of the proﬁle. To make interpretation easier, we will make the comparison with respect to the normalised mean (the mean divided by the maximum value in the 13 group Figure 9.3 Interpretation of cluster 7. TEAM FLY PROFILING WEBSITE VISITORS 269 pages), which varies between 0 and 1. For each behavioural segment, we shall use a graph that has the normalised mean on the horizontal axis and the 13 webpages on the vertical axis, plotted in decreasing order of their normalised mean. There is insufﬁcient space to show all the clusters, but we do illustrate a few typical ones. First of all we have clusters that represent monothematic behaviours, people who visit mostly one speciﬁc area of the website. This behaviour occurs for the visitors in cluster 7, and Figure 9.3 shows that it describes visitors who mostly go to help. In other words, this group of users visits the site to ask for help about how to use the different products. The visitors in clusters 10, 9, 8 and 4 also exhibit monothematic behaviours; they correspond to visitors who go mainly to area, windows, download and office, respectively. A second type of behaviour is polythematic behaviour, illustrated by cluster 6. Figure 9.4 shows that these visitors exploit all areas of the site. They are curious visitors who surf the pages reserved for hardware, the pages reserved Figure 9.4 Interpretation of cluster 6. 270 APPLIED DATA MINING for businesses and the pages which give information about new products and services. They are also interested in areas connected to technical help, entertain- ment, downloads and Ofﬁce products. A similar behaviour is seen in cluster 1, which is perhaps less business-oriented than cluster 6. A third type of behaviour is intermediate between the previous two, but it can also be interpreted as proﬁling speciﬁc categories of visitor. An example is cluster 5. Figure 9.5 shows that this cluster can represent software developers, namely IT professionals in the ﬁeld of programming and the development of software solutions. They often visit pages on development, specialised software, hardware and downloads. And they are probably involved in the care and development of their company’s computing power, hence the high number of visits to the business area. Other intermediate behaviours are exhibited by clusters 2 and 3. Cluster 2 can represent workers who do their duty, people who use the site mostly for Figure 9.5 Interpretation of cluster 5. PROFILING WEBSITE VISITORS 271 business and initial. In other words, it is the proﬁle of people who use the site for work reasons without being distracted by other things. Cluster 3 can represent workers in pause, people who mainly visit entertainment;theyalso go to pages reserved for businesses and other interests pages. In other words, it is the proﬁle of people visiting the site for work reasons but who are also taking a break. 9.6 Summary report • Context: this case study concerns customer proﬁling on the basis of web behaviour. The context is very broad, as the analysis refers to any type of problem involved with classifying people, companies or any other statistical units into homogeneous groups. • Objectives: the aim of the analysis is to classify customers into an unknown number of homogeneous classes that will then be interpreted on the basis of their statistical characteristics, such as mean values of the variables employed. The classiﬁcation is unsupervised: there is no target variable and all the avail- able information should be used to form homogeneous groups, or clusters. • Organisation of the data: in this case study the data was elaborated from the log ﬁle that registers access to a website. Consequently, there is a data matrix that records for each visitor the number of times they have viewed a collection of pages. For computational tractability, the pages were grouped into 13 web areas, homogeneous in terms of their content. Therefore the data matrix considered in the analysis contains 32 711 rows (visitors) and 13 columns (one counting variable for each area). • Exploratory data analysis: this phase of the analysis revealed a high level of dispersion with respect to the pages visited. Each visitor looks, on average, at 4 pages, and this conﬁrms the validity of grouping the 104 visited pages into 13 areas. • Model speciﬁcation: the analysis objectives suggested a descriptive model that would group observations into homogeneous classes. Given the size of the data set, we considered non-hierarchical cluster analysis models based on the K-means algorithm and Kohonen maps. To compare the two approaches fairly, we considered a 5 × 2 Kohonen map, which corresponds to the same number of clusters (10) obtained with the K-means algorithm. • Model comparison: models were ﬁrst compared by splitting the total variabil- ity into within-group variability and between-group variability, leading to the calculation of the overall R2 and R2 for speciﬁc area variables. The result of the comparison favours Kohonen maps, which also have the advantage that the groups obtained tend to be more distinct than the groups from K-means clustering. We then compared the models in terms of their predictive ability. We did this by using the clustering variable as a ‘target’ variable, ﬁtting a classiﬁcation tree and following a cross-validation approach. Yet again the results favour Kohonen maps. 272 APPLIED DATA MINING • Model interpretation: the interpretation of the results should be based on the obtained cluster proﬁles. For the Kohonen map, which performed better, we interpreted each cluster proﬁle by looking at the comparison between the overall mean and the cluster-speciﬁc mean of each of the 13 variables. This allowed us to identify each cluster proﬁle. Expert knowledge is needed to elucidate the business meaning of each proﬁle. CHAPTER 10 Customer relationship management 10.1 Objectives of the analysis This case study looks at statistical aspects of customer relationship management (CRM). In this context a company has as a primary objective to encourage customer loyalty, to obtain from them as much value as possible. The necessity of having loyal customers motivates companies to know them well. One way to do this is by using valid management and processing of the customer database. Data mining methods represent a valid approach to extracting precious information from such a database, and the information is then used to manage relations with existing and prospective clients. Companies increasingly personalise their services to suit each client. The data in this case study is from a company that foresaw the potential and has long been using statistically driven CRM to help manage its sales network. I cannot mention its name, but it sells mail-order merchandise in Italy. The objective is to study the buying behaviour of the company’s clients and, in particular, to understand from the outset which factors might create an, occasional buyer or a loyal shopper. This may indicate at an early stage which customers will be really proﬁtable and where to concentrate any marketing efforts. In data mining terms, we are concerned with a problem of predictive classiﬁcation. Another common data mining problem, churn analysis, can be seen as somewhat similar to loyalty analysis, as churners can be identiﬁed as disloyal clients. So this case study can also help with churn models. 10.2 Description of the data Although schematic and concise, the analysis here is a complete representation of the data mining process, from the construction of the data matrix to the com- munication of the ﬁnal results. It can therefore give an idea of what is involved in setting up the whole data mining process (Section 1.2). Information on the Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 274 APPLIED DATA MINING reference population, the company’s current customers, is distributed across three distinct databases, containing the list of customers and their characteristics; the list of orders collected by the local agencies, later transmitted to the company; and the list of buying orders transmitted by the agencies to the company. All three databases contain variables on the customers, mainly socio-demographic and behavioural variables. These variables refer to the ways in which the ﬁrst commercial contact has been established (e.g. the number of products bought and the method of payment). To achieve the goals of the analysis, it is useful to analyse a homogeneous cohort of consumers, that is, to analyse the behaviour over time of people whose ﬁrst contact with the company occurred at roughly the same time, the entry period. This eliminates possible bias effects due to structural changes in the economy, or in the structure of the company. We ﬁrst consider all clients that have entered the customer database between 1992 and 1996; the total number is very large, equal to 210 085. It would be very expensive and time-consuming to analyse the whole data set, so we will take a stratiﬁed sample and analyse that. We will take the same number of clients from each time slot over the whole entry period; this sample contains a total of 2470 customers. Generally it is not necessary to sample the data; the main reason for doing it here is the low quality of the available data. Finally, as data is spread across three databases, we need to construct a mar- keting database (datamart) that organises all the information we require; this is not a simple procedure. The input data sets contain data collected for operational purposes and the records differ in their type and structure. We need to obtain a coherent and informative data set by eliminating a great deal of information. The end result is a data matrix with one row for each customer and one col- umn for each customer characteristic (statistical variable). After a long process of database management, we obtain the variables in Table 10.1. Table 10.1 The available customer variables. • Marketing status • Dimension of the shop • Whether the client is active • Age • Whether the client is in a debt position • Area of residence • Total number of orders • Sex • Date of ﬁrst order • Whether ﬁrst payment is with instalments • Date of last order • First amount spent • Total amount ordered • Number of products at ﬁrst order • Total amount paid • Current balance • Whether payments have been delayed • Time lag between ﬁrst and second order • Amount of current instalment • Residual number of instalments CUSTOMER RELATIONSHIP MANAGEMENT 275 10.3 Exploratory data analysis Before starting the actual data analysis, we need to identify a response variable; we need to deﬁne the explanatory variables and suggest possible transformations. The main objective is to classify the customers in two categories: those that place only one order, and those that make further orders. This binary variable, indicated by Y, can be deduced from the ‘total number of orders’ variable in Table 10.1. We shall set Y = 0 when the number of orders is equal to one and Y = 1ifthe number of orders is greater than one. In the stated objectives, the two levels of the response variable correspond to consumers deemed disloyal (Y = 0) and loyal (Y = 1). Therefore a customer is deemed loyal if they place at least two orders with the company. Table 10.2 shows the distribution of this response variables for our sample. We could have chosen other variables to represent Y, such as the total expenditure and the number of items bought. Notice that the sample is well divided between the two categories. There are also more than 19 observations for which the value of the response variable is missing. We will ignore these observations. Consider now the choice of explanatory variables. We want variables that will help us with predictive classiﬁcation. Intuitively it seems important to consider variables that concern the ﬁrst order, that describe how the ﬁrst contact with the company is established, as well as the socio-demographic variables available on the customers ‘age, sex, area of residence’ and the dimension of the corresponding agency (see later). These are the variables on the right-hand side of Table 10.1. A few data items are missing for the explanatory variables, but as there are only a few, we can substitute a location measure for the distribution of the valid data, such as the mean (continuous variables) the median (ordinal variables) and the mode (nominal variables). Table 10.3 shows the conditional distribution of the response variable on the socio-demographic variables. We can draw the following conclusions: • Sex: at ﬁrst sight, it does not seem very inﬂuential on the response variable, as there is no substantial difference in the distribution for males and females. • Area of residence: see how the conditional probability of Y = 1 decreases when the area changes from North to Centre and from Centre to South (of Italy). This seems to be a predictive variable. South includes the Italian islands, and sometimes we will call it ‘south and islands’. • Age: the odds for Y = 1 are close to 1 for the oldest group, but notice- ably lower than 1 for the other two classes, and especially for the youngest Table 10.2 Distribution of the response variable. Modality Absolute frequency Relative frequency (%) Y = 0 1457 59.71 Y = 1 1013 40.29 276 APPLIED DATA MINING Table 10.3 Conditional distribution of the response variable on the socio-demographic explanatory variables. Sex Y=0 Y=1 Female 61.04% 38.96% Male 57.88% 42.12% Area Y=0 Y=1 North 55.40% 44.60% Center 58,22% 41.78% South 62.73% 37.27% Age Y=0 Y=1 15–35 68.80% 31.20% 36–50 53.44% 46.56% 51–89 60.42% 39.58% Dimension Y=0 Y=1 Small 60.39% 39,61% Medium 56.95% 43.05% Large 62.11% 37.89% Table 10.4 Contingency table classifying the response variable and the instalment variable. Y instalment Frequency Percent Row Pct Col Pct 0 1 Total 0 1239 218 1457 50.16 8.83 58.99 85.04 14.96 68.04 33.59 1 582 431 1013 23.56 17.45 41.01 57.45 42.55 31.96 66.41 Total 1821 649 2470 73.72 26.28 100.00 group. This variable may be a relevant predictor, as the probability of Y = 1 increases noticeably with age. • Dimension of the agency: this represents the only information we can use to reconstruct the location of the agency, an important variable that we do not have. Dimension of the agency subdivides the agencies into three classes on the basis of number of clients served: if the number is less than 15, the agency is considered small; if the number is between 15 and 30, it is considered medium; if it is greater than 30 (up to a maximum of 60), it is considered large. In Table 10.3 notice how the conditional probability of Y = 1getslower for large agencies. Unlike the previous variables, the effect on the response CUSTOMER RELATIONSHIP MANAGEMENT 277 variable is not monotone with respect to the order of the explanatory variable, as medium agencies appear to show the highest conditional probability. Besides socio-demographic variables, we also have behavioural variables that refer to the customer’s ﬁrst order: • Instalment: a binary variable that indicates whether the ﬁrst purchase is paid for in instalments (level 1) or not (level 0). It indicates the length of the relationship between the customer and the company. If a person pays in instalments, the contact with the company will tend to be longer. Table 10.4 shows the contingency table that cross-classiﬁes the response variable with the modalities of instalment. Notice the positive association of this variable with Y, as the odds ratio is about 4.20. • First amount spent and number of products at ﬁrst order (numb):thesetwo quantitative variables seem particularly informative about the behaviour of (a) (b) Figure 10.1 Conditional distribution of (a) the amount spent and (b) the number of products with respect to the levels of Y. 278 APPLIED DATA MINING the client at the ﬁrst commercial contact. The ﬁrst amount spent is expressed in Italian lire (C–– 1 = L1936.27). Figure 10.1 shows the boxplots for these two variables. If the two resulting boxplots differ markedly in position (e.g. in terms of the median), the corresponding variable can be deemed relevant. The amount spent seems to be relevant, but the number of products bought does not. And there appear to be outliers in the right tail of the distribution. We proceed by removing observations above the 99th percentile of the two variables. We will not transform the two quantitative variables; we will leave them as they are. But to help with interpretation, we will binarise the qualitative variables age, area and dimension of the agency, all of which have three levels. This gives a total of nine binary variables. In ﬁtting linear models, to avoid perfect collinearity, it is necessary to remove the intercept or to remove one binary variable for each of the three variables. We will leave the intercept. Table 10.5 shows an extract of the available data matrix, summarising the variables we will use in the analysis. 10.4 Model building We will employ different types of model, and some of them we will not be able to compare using statistical techniques. In Section 10.5 we will therefore rely on cross-validation. 10.4.1 Logistic regression models Having selected our explanatory variables, we need to ﬁnd the ones that can effec- tively predict the response variable. The ﬁrst model we consider is the logistic regression model. In order to choose a model, we follow a stepwise procedure, basedintheG2 deviance difference, with a level of signiﬁcance equal to 0.05. Table 10.6 presents the results obtained from the stepwise procedure, namely the selected model along with the corresponding parameter estimates and estimated odds ratios. Only three of the available seven variables signiﬁcantly affect Y: the binary instalment variable, with a strong positive association measured by an odds ratio of about 5; the age variable or, more precisely, the youngest class variable, with a negative effect determined by an odds ratio of about 0.580; and the numb variable, with a mild positive association expresses by an odds ratio of about 1.356. As numb is discrete, the effect should be interpreted by saying that a unitary increase in the number of products determines an increase in the odds of Y = 1 of about 1.356. For the age variable, there is no signiﬁcant distinction between the adult class (36–50) and the mature class (51–89); what matters is whether or not the customer is young (15–35). The model as a whole is signiﬁcant, with a log-likelihood score equal to 254.928, leading to rejection of the null model; the degrees of freedom for this test is 3. In the next section we will assess the predictive ability of the logistic CUSTOMER RELATIONSHIP MANAGEMENT 279 Table 10.5 The considered data matrix. 280 APPLIED DATA MINING Table 10.6 The selected logistic regression model. Estimates Stderr Wald Pr>Chi-square Odds ratio Intercept 0.3028 0.1248 108.93 <.0001 – age15−35 -0.5440 0.1367 15.84 <.0001 0.580 installment 1.6107 0.1371 137.98 <.0001 5.006 number−of−products 0.3043 0.0465 42.78 <.0001 1.356 regression model, which has the advantage of producing transparent results inter- pretable in a linear form. The logistic discriminant rule in this case study allows us to distinguish a priori customers that are proﬁtable (Y = 1) from those that are less proﬁtable, therefore we can devise different ways of targeting customers. On the basis of the estimated model in Table 10.6, we can see how the discriminant rule works. For each new customer that has placed a ﬁrst order, we need to know three things: whether they are young (variable A), whether they pay in instal- ments (B) and how many products they order (C). Let ta, tb, tc be the estimated parameters of the three variables, whose values are given in Table 10.6, and let t be the estimated intercept. A customer will be proﬁtable if the estimated prob- ability of ordering more than once is greater than 0.5, and this corresponds to checking whether the inequality t + taA + tbB + tcC>0 is true. For instance, if a customer is not young, pays in instalments and buys one product, they are proﬁtable as (−1.3028 × 1) + (1.6107 × 0) + (0.3043 × 1) = 0.6122 > 0. If a customer is not young, does not pay in instalments and buys one product, they are likely to be less proﬁtable as −1.3028 + 1.6107 + 0.3043 =−0.9985. The estimated probability of reordering is about 0.648 in the ﬁrst example and 0.269 in the second. The logistic regression model can therefore provide a simple scoring mechanism for each customer that can be used for decision making. 10.4.2 Radial basis function networks For a neural network model, we will choose a radial basis function (RBF) with one hidden node. This is because there may be a neighbourhood structure in the input variable space and we might be able to explain it. We have considered 13 explanatory variables: all those present in Table 10.5 apart from the response variable Y. As a combination function for the input variables, we will take a Gaussian radial basis function with equal widths and equal heights. The activation function for the hidden node is the identity function and the activation function for the output node is the softmax function, so we obtain normalised output values corresponding to the estimated probabilities of Y = 1. The parameters of the network are learned by minimising the misclassiﬁcation rate in the validation data set. Figure 10.2 shows how the misclassiﬁcation rate evolves with successive iterations. The process can be stopped when the misclassiﬁcation rate stabilises, and Figure 10.2 indicates that seven iterations are sufﬁcient. CUSTOMER RELATIONSHIP MANAGEMENT 281 Figure 10.2 Evolution of the misclassiﬁcation rate for the RBF network. A neural network of this kind produces a list of ﬁtted weights, with no assess- ment of their signiﬁcance. We will not publish the list here; in any case, the largest weights correspond to the variables selected using the logistic regres- sion model. It is not advisable to use the ﬁtted weights of a neural network to build a discriminant rule, because the ﬁtted weights are not corrected for sample variability, so they may depend heavily on the sample at hand. We will suspend comment on the RBF discriminant rule for the time being. 10.4.3 Classiﬁcation tree models We begin by comparing two CART tree models, based on the entropy and the Gini impurity. The better model is based on the Gini impurity. The results from the better tree are based on a pruning algorithm that leads to an optimal number of terminal nodes. It does this by minimising the misclassiﬁcation rate. Figure 10.3 shows the behaviour of the classiﬁcation accuracy (the complement of the mis- classiﬁcation rate) as the number of terminal nodes (leaves) increases. Notice that the optimal conﬁguration of the decision tree is obtained when the number of leaves equals 11. The corresponding tree can be described in terms of 11 association rules, pointing towards the leaves, that take the 1465 customers in the training data set and split them into 11 target groups, each with a differ- ent estimated probability of ordering again (Y = 1). Table 10.7 gives the list of these rules. In Table 10.7 each rule is stated according to the path that leads from the root node to the terminal node. But the list of conditions that express a rule is 282 APPLIED DATA MINING Figure 10.3 Evolution of the classiﬁcation accuracy for the classiﬁcation tree as the number of leaves increases. written in reverse order, so that nodes farther from the leaf come closer to it in the rule. Here is the association rule with the highest support; about 48.3% of the customers follow this rule: IF (375000 ≤ FIRST AMOUNT SPENT < 2659000) AND (INSTALMENT = 0), THEN (Y =0) This rule corresponds to a leaf obtained from splitting all observations by instalment, and then by the inequality 375000 ≤ FIRST AMOUNT SPENT < 2659000. Customers that obey the conditions of this rule are estimated to be not proﬁtable, as the estimated probability of Y = 1 is only 18.6%. This explains why the head of the rule is Y = 0. In general, the head of the rule follows the classical discriminant rule: if the ﬁtted probability is less than 50% then Y = 0; otherwise Y = 1. Table 10.7 The rules for the classiﬁcation tree. IF 2659000 <=FIRST−AMOUNT−SPENT AND INSTALMENT EQUALS 0 THEN N : 226 1 : 56.2% CUSTOMER RELATIONSHIP MANAGEMENT 283 Table 10.7 (continued) IF 2659000 <=FIRST−AMOUNT−SPENT 0 : 43.8% IF FIRST−AMOUNT−SPENT < 515000 AND INSTALMENT EQUALS 1 THEN N: 55 1 : 89.1% 0 : 10.9% IF 375000 <=FIRST−AMOUNT−SPENT < 2659000 AND INSTALMENT EQUALS 0 THEN N : 709 1 : 18.6% 0 : 81.4% IF NORTH EQUALS 0 AND NUMBER−OF−PRODUCTS < 2.5 AND 515000 <=FIRST−AMOUNT−SPENT AND INSTALMENT EQUALS 1 THEN N: 99 1 : 47.5% 0 : 52.5% IF NORTH EQUALS 1 AND NUMBER−OF−PRODUCTS < 2.5 AND 515000 <=FIRST−AMOUNT−SPENT AND INSTALMENT EQUALS 1 THEN N: 42 1 : 73.8% 0 : 26.2% IF 2.5 <=NUMBER−OF−PRODUCTS < 5.5 AND 515000 <=FIRST−AMOUNT−SPENT AND INSTALMENT EQUALS 1 THEN N : 178 1 : 78.7% 0 : 21.3% IF 5.5 <=NUMBER−OF−PRODUCTS AND 515000 <=FIRST−AMOUNT−SPENT AND INSTALMENT EQUALS 1 THEN N: 3 (continued overleaf ) 284 APPLIED DATA MINING Table 10.7 (continued) IF 2659000 <=FIRST−AMOUNT−SPENT 1 : 0.0% 0 : 100.0% IF FIRST−AMOUNT−SPENT < 105000 AND NORTH EQUALS 1 AND INSTALMENT EQUALS 0 THEN N: 7 1 : 0.0% 0 : 100.0% IF 105000 <=FIRST−AMOUNT−SPENT < 375000 AND NORTH EQUALS 1 AND INSTALMENT EQUALS 0 THEN N: 59 1 : 72.9% 0 : 27.1% IF AGE36−50 EQUALS 1 AND NORTH EQUALS 0 AND FIRST−AMOUNT−SPENT < 375000 AND INSTALMENT EQUALS 0 THEN N: 47 1 : 25.5% 0 : 74.5% IF AGE36−50 EQUALS 0 AND NORTH EQUALS 0 AND FIRST−AMOUNT−SPENT < 375000 AND INSTALMENT EQUALS 0 THEN N: 40 1 : 52.5% 0 : 47.5% The classiﬁcation tree thus provides an immediate discriminant rule, based on partitions of the explanatory variables. To allocate a new customer, we begin at the root and take the path corresponding to the characteristics of the customer; then we see whether the terminal leaf gives a probability higher than 50% to Y = 1. The difference with the logistic regression model is that the discriminant rule is a hierarchical logical statement (based on partitions of the data) rather than an additive score (based on all the data). The variables that appear relevant for classiﬁcation are instalment, number of products, and age, which correspond CUSTOMER RELATIONSHIP MANAGEMENT 285 Figure 10.4 The chosen CHAID tree. to the signiﬁcant variables in the logistic regression model. The tree also ascribes relevance to the ﬁrst amount spent and the geographic area. For completeness, we also implement a CHAID tree model, with a signiﬁ- cance level of 0.01, in order to obtain a more parsimonious classiﬁcation tree. The resulting model has a higher misclassiﬁcation rate than our Gini-based CART model, but it is parsimonious. Figure 10.4 shows the CHAID tree. The two dis- criminant variables are instalment and ﬁrst amount spent. These two variables are also the ones ﬁrst chosen in the CART tree of Table 10.7. The difference is that the CHAID tree does not go into as much depth. On the other hand, comparing the results with logistic regression, the instalment variable appears in both, but age and number of products are now replaced by ﬁrst amount spent. 10.4.4 Nearest-neighbour models In a nearest-neighbour model, the main parameter to choose is the width K;this establishes the size of the neighbourhood of the explanatory variables that will be used to predict Y. We begin with a very large value, K = 732, which corresponds to half the total number of observations in the training data set. Then we try a very low value, K = 10. It turns out that K = 10 is a better choice, in terms of misclassiﬁcation rate: for K = 732 it is 0.41; for K = 100 it is 0.328; and for K = 10 it is 0.316. The misclassiﬁcation rate increases for lower values of K. We will thus choose K = 10. 286 APPLIED DATA MINING For nearest-neighbour models there is no analytic form to comment upon, as the method is memory-based, namely, it recalls the neighbouring data only when a prediction is in order. Section 10.5 evaluates its predictive performance. A nearest-neighbour model works well when the observations are well sep- arated in the space of the explanatory variables and when the corresponding groups are fairly pure. In an ideal situation, observations should be partitioned in non-overlapping regions, possibly of a small size, and each region should contain observations with a similar value of the response variable (here 0 or 1). 10.5 Model comparison We ﬁrst compare models in terms of the confusion matrices obtained on the validation data set. For all models, we have chosen a cut-off threshold of 50%, and the errors are derived on that basis. Table 10.8 shows the confusion matrix for the ﬁnal logistic regression model. In this table and the following ones, frequencies are expressed as percentages. Table 10.8 shows that the model predicts as non- proﬁtable (Y = 0 predicted) customers that in fact are proﬁtable (Y = 1 observed) in 22.92% of the cases; this is the type I error. On the other hand, it predicts as proﬁtable (Y = 1 predicted) those that are not (Y = 0 observed) in 10.91% of the cases; this is the type II error. Whether the logistic regression model leads to a valid discriminant rule, depends on marketing evaluations on the relative costs of the two errors. Usu- ally, if a customer is targeted to be proﬁtable, a direct marketing campaign is dedicated to them by mail, telephone calls, etc. If a customer is not targeted to be proﬁtable, they will not be part of the campaign. Therefore the cost of the type I error depends on the probability of losing customers that are not targeted, although they would be proﬁtable; the cost of the type II error is the money allocated by the company to follow customers that are probably not worthy of attention. From Table 10.8 the logistic regression model leads to a higher type I error, and should be chosen if type II error is deemed more costly than type I error. Table 10.9 shows the confusion matrix for the chosen CART tree model. Notice that the total misclassiﬁcation rate for the classiﬁcation tree is slightly Table 10.8 Confusion matrix for the logistic regression model. Predicted 01 0 48.02 10.91 Observed 1 22.92 18.14 CUSTOMER RELATIONSHIP MANAGEMENT 287 lower than for the logistic regression model, 29.74% against 33.83%. Further- more, the probabilities of the two types of error are rather balanced. The tree model should therefore be chosen in the absence of information on the costs of the two errors, or when the costs are roughly equivalent. Table 10.10 shows the confusion matrix for the RBF network. Notice that the total misclassiﬁcation rate is about 32.47%, lower than for logistic regres- sion but higher than for CART. The probabilities of the two errors are unbal- anced, as with logistic regression. Overall we can draw the same conclusions as for logistic regression. However, the slight improvement does not justify the increased model complexity and difﬁculty of interpretation compared with logistic regression. Table 10.11 shows the confusion matrix for the nearest-neighbour model. It turns out that the nearest-neighbour model has the same total misclassiﬁca- tion rate as the tree model, 29.74%, and is therefore as good overall. But the probabilities for type I and type II errors are slightly more unbalanced, and the nearest-neighbour model has the lowest type I error probability among all the considered models. Therefore, if type I error costs are higher than type II error costs, a nearest-neighbour model should be chosen. If the relative error costs are not a consideration, both the CART tree and the nearest-neighbour model can be chosen, as they minimise the misclassiﬁcation error rate over the valida- tion set. Table 10.9 Confusion matrix for the CART classiﬁcation tree. Predicted 01 0 43.52 15.42 Observed 1 14.32 26.74 Table 10.10 Confusion matrix for the RBF neural network. Predicted 01 0 47.34 11.60 Observed 1 20.87 20.19 288 APPLIED DATA MINING Table 10.11 Confusion matrix for the nearest-neighbour model. Predicted 01 0 41.34 17.60 Observed 1 12.14 28.92 So far we have drawn our conclusions using the validation data set. But since some data mining models are often built using results on the validation data set, it may be of interest to compare models on a third data set, named the test data set. To do this, the available data should be partitioned into three data sets instead of two: a training data set (60% of the data), a validation dataset (20% of the data) and a test data set (20% of the data). Then the predictive power of the models can be compared on the test data set, to obtain a more neutral evaluation. When there are only two data sets, not three, the second data set (for validation) is sometimes indirectly used to build a model (e.g. to prune a tree, to choose the number of hidden nodes in a neural network, or to choose the number of neighbours in a nearest-neighbour method); consequently, the outcome of the validation may be too optimistic. Splitting the data set into three implies a loss of information, as the test data set is never used, and the number of observations in the training data set is reduced. The extra sampling process introduces a new source of variability and it could increase the instability of the results. We consider a threefold partitioning accomplished in a stratiﬁed way to main- tain the proportion of Y = 1andY = 0 in each of the three data sets. We have placed 60% of the data in the training data set, 20% in the validation data set and 20% in the test data set. Table 10.12 shows the misclassiﬁcation rates for the models on all three partitions training, validation and test. On the test set, the tree model has the lowest error, followed by the nearest-neighbour model, called MBR for memory-based reasoning, then the RBF network and ﬁnally the logistic regression model. The same ranking of the models is obtained on the training data set, but on the validation data set there is a tie between MBR and the tree model. The CHAID tree in Figure 10.4 leads to a misclassiﬁcation error of 0.3237 on the test set, higher than the CART tree and MBR. Table 10.12 Summary comparison of misclassiﬁcation errors. CUSTOMER RELATIONSHIP MANAGEMENT 289 Now would be the time to approach marketing experts for a cost function using the costs of the type I and type II errors. But even without this informa- tion, we can still go a bit further in our model selection. Up to now we have used a cut-off threshold of 50%, but this need not be the only choice. In particular, the costs of the errors may lead us to change the cut-off. For instance, if type II error is deemed more costly, a higher cut-off can be chosen, so as to predict fewer events Y = 1 and decrease type II errors; but this will increase type I errors. Conversely, if type I error is deemed more costly, a lower cut-off can be chosen. In the absence of cost considerations, the models should be compared using ROC curves. Figure 10.5 shows the ROC curves of the four models being com- pared. The vertical axis is the sensitivity, equal to 1 − the probability of type I error, and the horizontal axis plots 1 − speciﬁcity, equal to the probability of a type II error. Notice that the ROC curves for all four models are rather similar, apart from a gap in the central part of the curve, where the tree model and the nearest-neighbour model (called user) are better. This is the area where the Figure 10.5 ROC curves for the considered models. The curve called user is the MBR model. 290 APPLIED DATA MINING Table 10.13 Comparison of Gini indexes of performance. Logistic regression RBF Tree Nearest neighbour Gini index 0.4375 0.4230 0.4445 0.5673 50% cut-off error probabilities fall. Conversely, in the upper right part of the graph, the neural network model and the logistic regression model are better, as they lead to a higher sensitivity (lower type I error). All the curves are similar for high values of the cut-off, corresponding to low values of sensitivity and 1 − speciﬁcity. To decide between the curves, we need more information about the costs. But without this information we can calculate a summary measure of performance for the models, which corresponds to the area between the ROC curve and the 45◦ line; it is called the Gini index of performance (Section 6.6). We calculate the Gini index for all four models on the test data set, and for nine equally spaced cut- off points (from 10% onwards). The values are given in Table 10.13. The higher the Gini index, the better the considered model. Therefore the nearest-neighbour model is the best model, followed by the tree model, the logistic regression model and the RBF model. To conclude, the nearest-neighbour model should be chosen in the absence of cost considerations (and cut-off considerations) or when type I error is more costly. We should make this choice if we are not interested in having an explicit discriminant rule that can decide whether or not a customer is proﬁtable. Oth- erwise we should consider a classiﬁcation tree model. If type II error is more costly, a logistic regression model would also be ﬁne. 10.6 Summary report • Context: this case study concerns analytic customer relationship management (CRM). In this framework the main objective is to encourage customer loy- alty to obtain from customers as much value as possible. CRM can be applied in a fairly broad context; in general, it can be applied to any situations where a company’s customer database is used to classify clients into classes or segments that identify different targets for future actions of the company. A common application is to identify loyal customers; another common appli- cation is churn analysis, which identiﬁes clients that abandon a company, perhaps for competitors. The next case study looks at the related problem of scoring, where each client is assigned a score before classiﬁcation. The scoring aspect was not emphasised here, as the main interest was in classifying the customers. • Objectives: the aim of the analysis is to classify customers into homogeneous classes that identify different target proﬁles. Customers are divided into more CUSTOMER RELATIONSHIP MANAGEMENT 291 valuable customers (more than one order in the considered period) and less valuable customers (only one order in the considered period). • Organisation of the data: data is usually contained in a so-called customer table, or marketing database (data mart). We considered a common situation, especially for small and medium-sized companies, in which there is not a data warehouse and therefore a customer database has to be built up from operational databases. This process is rather difﬁcult and typically eliminates much information due to inconsistencies or missing data. To consider a clean and statistically representative data mart (e.g. with regard to the entrance time in the database), we had to take a sample whose size appears very small when compared with the original customer database. This shows just how important it is to consider how the data will be analysed when building up a data warehouse or a system of operational databases. At the end of the process, besides the binary response variable indicating the value of a customer, we obtained seven explanatory variables, ﬁve discrete and two continuous. • Exploratory data analysis: this was conducted by analysing the bivariate dis- tributions involving the binary response variable and the seven candidate predictors taken one at a time. • Model speciﬁcation: the analysis objective suggested a predictive classiﬁ- cation model that allocates customers to the classes more valuable and less valuable. We considered four types of model: logistic regression, classiﬁcation trees, RBF networks and nearest-neighbour models. • Model comparison: the models were compared using a computational cross- validation approach. We compared classiﬁcation errors of the models on a validation data set and then on a more independent test dataset. We used the standard discriminant threshold rule of 50%. Then we compared ROC curves, which are drawn by varying the discriminant threshold. To give a summary performance measure, we calculated the Gini index for each ROC curve. The result is that, in the absence of considerations on error costs or when type I error is deemed more costly, the classiﬁcation tree performs best with a 50% threshold and the nearest-neighbour model performs best with a varying threshold. If type II error is more costly, a logistic regression model would also be ﬁne. • Model interpretation: on the basis of model comparison, it seems that classiﬁ- cation trees and nearest-neighbour models are the best tools for the considered predictive task. However, there is an interpretational difference that should be taken into account. Classiﬁcation trees produce rules that are easily inter- pretable in logical terms but nearest-neighbour models do not deliver explicit rules. Nearest-neighbour models are non-parametric tools, hence they have the advantage of not requiring strong modelling assumptions but they are harder to interpret. The choice between the two tools depends on who will be using the results. If the results will go to trained statisticians or IT. experts, then it probably does not matter which model is chosen, but trees are prob- ably better for business experts who would like a more user-friendly picture of what is going on. CHAPTER 11 Credit scoring 11.1 Objectives of the analysis This case study applies data mining methods to the problem of credit scoring for consumer credit. It looks at how to evaluate the credit reliability of indi- viduals who ask for credit when buying goods or services. The various credit operators (e.g. banks, investment companies and credit card societies) receive thousands of applications every day, so they need a system to help them grant or refuse requests. Recent studies have proposed decision support systems, or scoring models. They are fast, objective and inexpensive, which makes them extremely efﬁcient when compared with traditional scores based on experience. This is especially true for consumer credit, where the monetary value of each loan is quite small. We will take customer data from an important bank in southern Germany and use it to construct a scoring model for consumer credit. Automatic systems are widely used for evaluating creditworthiness, especially by banks and investment companies. There are many reasons for this develop- ment, including the regulatory policy that was recently established by the Basel Committee of central banks (www.bis.org). These regulations include a detailed prescription of how credit risk should be calculated by ﬁnancial institutions and they apply to different types of credit, including consumer credit and business credit. Consumer credit means lending to individuals and households for the acquisition of goods or services. The exercise of this activity is reserved by law to banks, specialised intermediary backers and, in some forms, to the sellers of goods or services. The term ‘credit scoring’ describes the statistical methods used to classify possible creditors into two classes of risk: good and bad. Statistical models for credit scoring, often known as scorecard models, use explanatory variables obtained from information about the applicant to estimate the probability of a loan’s non-repayment. A credit request is granted or refused by comparing the estimated probability with a suitable threshold chosen by the management. The statistical methods most often used to develop scorecards are neural networks, logistic regression and classiﬁcation trees. The literature on credit scoring and credit scorecard models is quite vast; see for instance Hand and Henley (1996). The data to construct a scorecard is generally obtained from a sample of applicants to whom credit has already been granted, and for whom it is known Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 294 APPLIED DATA MINING whether or not the creditor was reliable. To calculate the score for a speciﬁc credit request, the customer’s data is compared with the scorecard, in order to classify the new applicant into one of the observed behavioural patterns and determine a predictive score. Often a scorecard model is able to assign a score to every measurable characteristic of the applicant. These scores are then summed to produce an overall score. 11.2 Description of the data The data set is 1000 observations on 1000 credit applicants to an important bank in southern Germany; see Fahrmeir and Hamerle (1994) for a more detailed description of the data. We consider 21 variables; one of them is the binary variable Y, credit reliability (Y = 0 for the reliables, Y = 1 for the non- reliables), which we treat as the response or target variable. The other 20 variables are treated as explanatory variables. They can be grouped in the following way. Table 11.1 shows the data matrix. • Socio-demographic variables • sex and marital status • age • residence: number of years resident in the present home • Personal and ﬁnancial variables • account: whether owners of a bank account • bank book: whether owners of a bank book • previous rep: history of past repayments • debts: amount of previous debts • concurrent: whether other fundings have been required • employment: type of employment • working years: number of working years • foreign: whether foreign worker • family: number of people in charge of • Variables speciﬁc to the loan • loan: amount of the loan • purpose: purpose of the loan Table 11.1 The structure of the data matrix. Applicant YX1X2... X3 ... ... ... ... ... X20 1 1118... 1049 ... 1 ... 34 1424... 1376 ... 1 ... 1000 0130... 6350 ... 1 CREDIT SCORING 295 • deadline: deadline of the loan • monthly interests • others: whether other concurrent debtors are speciﬁed • Indicators of wealth • house: whether owner of a house • effects: whether has other personal guarantees • telephone: whether a telephone is available Only 3 of the 20 explanatory variables are continuous: deadline, loan and age. The other 17 are discrete but only 2 of these are binary: telephone and foreigner. The other 15 discrete variables have different numbers of levels; purpose has 11. The data is stratiﬁed into 300 customers selected as non-reliable (Y = 1, loans not repaid) and 700 as reliable (Y = 0, loans repaid). Therefore the percentages of good and bad customers are already ﬁxed. This kind of stratiﬁcation affects the results obtained from the statistical models; they will not be the same as the results from a simple random sample. The data set has an inherent bias as it contains only those individuals actually given a loan. There are others that did not get a loan and we do not know whether or not they would have been at risk. Although these considerations do not alter the validity of the analysis, we should remember them when we come to interpretation. Even though we will lose information, to simplify the analysis, we will modify the original data set to obtain exclusively binary variables. Binarisation allows us to investigate the odds ratio. For the quantitative variables we mainly calculate the median; we create two levels, one corresponding to values higher than the median, another to values lower than the median. For example, deadline had values in the interval of 0–72 months, but we have modiﬁed it as in Table 11.2 For all the other variables, we give the value 0 to the category that is less reli- able, and the value 1 to the category that is more reliable. Take previous rep as an example. We give the value 1 to the category corresponding to impecca- ble previous repayments and the value 0 to the category corresponding to late previous repayments. Some discrete variables have to be reclassiﬁed. For instance, account is subdivided into two new binary variables: good−account and bad−account. Table 11.3 shows the new and old classiﬁcations. The variable sex and mar- ital status is divided into two distinct binary variables: sex and marital status. Table 11.4 summarises this representation. Table 11.2 Classiﬁcation of the deadline variable. Previous classes New classes Interpretation Deadline > 18 months 1 Long-term Deadline < 18 months 0 Short-term 296 APPLIED DATA MINING Table 11.3 Classiﬁcation of the account variable. New variables Original variable bad−account good−account account 1 0 2 negative balance Bad 0 1 4 balance > DM 200 Good 0 0 3 balance in [0–200] Neutral 0 0 1 no account Neutral Table 11.4 Classiﬁcation of the sex and marital status variables. New variables Original Variable sex marital status sex and marital status 0 0 1 man: bachelor, divorced or separated 1 0 2 woman: bachelor, divorced or separated 0 1 3 man: married or widow 1 1 4 woman: married or widow 11.3 Exploratory data analysis We begin with a univariate analysis to investigate the intensity of the existing links between every explanatory variable and the response variable. This will indicate the efﬁciency of each explanatory variable in identifying the non-reliable clients (Y = 1). The explanatory variables that are more associated with the response variable should be better able to determine client’s reliability. Although it neglects interactions between the variables, univariate analysis often proves very useful. It is an important preliminary step in setting up a multivariate model. To investigate the association between the response variable and each of the 22 explanatory variables, we construct the odds ratio. Here we put Y = 1ﬁrstand Y = 0 second to help with interpretation. The resulting odds ratio is the reciprocal of the one we would obtain using the conventional order (Section 3.4). Now the higher the odds ratio, the more negative the association of the explanatory variable with the response variable and the higher the positive association with credit reliability. In other words, the results indicate the efﬁciency of each individual variables as an indicator of creditworthiness. Table 11.5 shows the odds ratios and the corresponding 95% conﬁdence inter- vals; the last column shows the p-value of the chi-squared Pearson statistic. The 22 explanatory variables are tabulated in decreasing order of the odds ratio. The ﬁrst eight variables in the table have a negative association with the response variable; in fact, the odds ratio shows a value higher than 1, and 1 does not fall in the conﬁdence interval. The last ﬁve variables have a positive association with the response variable, since the odds ratio takes values in the interval [0,1], and CREDIT SCORING 297 Table 11.5 Univariate odds ratios with the response variable. Variable Odds ratio 95% Conﬁdence interval Association Chi-square p-value good−account 5.459 (3.857; 7.725) (−) 1.41E-24 previous rep 3.958 (2.529; 6.193) (−) 1.21E-09 bank book 2.758 (1.957; 3.888) (−) 3.05E-09 deadline 1.842 (1.402; 2.421) (−) 1.22E-05 working years 1.781 (1.311; 2.421) (−) 2.47E-04 purpose 1.679 (1.269; 2.220) (−) 2.85E-04 age 1.676 (1.274; 2.206) (−) 2.48E-04 marital status 1.532 (1.160; 2.022) (−) 3.17E-03 monthly interests 1.342 (1.008; 1.787) (−?) 0.045 loan 1.241 (0.946; 1.627) NO 0.129 debts 1.233 (0.928; 1.639) NO 0.153 telephone 1.177 (0.892; 1.554) NO 0.261 residence 1.031 (0.785; 1.354) NO 0.835 family 1.018 (0.700; 1.481) NO 1.000 others 0.994 (0.624; 1.583) NO 1.000 employment 0.904 (0.651; 1.257) NO 0.563 sex 0.769 (0.584; 1.011) NO 0.067 effects 0.642 (0.489; 0.842) (+) 1.49E-03 bad−account 0.568 (0.423; 0.763) (+) 1.88E-04 concurrent 0.550 (0.395; 0.765) (+) 4.06E-04 house 0.531 (0.398; 0.710) (+) 1.99E-05 foreign 0.273 (0.096; 0.778) (+) 9.42E-03 1 does not fall in the conﬁdence interval. The variable monthly interests exhibits a probable negative association since the odds ratio is greater than 1, but 1 is slightly out of the conﬁdence interval. We will leave it in for now and use the multivariate analysis to make a ﬁrmer decision. The rest of the explanatory variables show no signiﬁcant association with the response variable, since the conﬁdence interval contains the value 1. These conclusions are all conﬁrmed by the p-values of the chi-squared statistics in the last column of the table. • For the ﬁrst eight variables and the last ﬁve variables the p-value is less than 0.05; this means that the null hypothesis is rejected and the existence of an association is accepted: • For monthly interests the p-value is slightly under 0.05; this means the association with the response variable is borderline signiﬁcant. • The remaining variables have a p-value greater than 0.05; this means the null hypothesis is accepted Table 11.6 shows how we derive the odds ratios and allows us to draw the following conclusions: 298 APPLIED DATA MINING Table 11.6 Interpretation of the odds ratios. Variable Odds for X = 1, θ1 Odds for X = 0, θ2 Odds ratio Association good−account 0.594 3.243 5.459 (−) previous rep 0.291 1.152 3.958 (−) bank book 0.078 2.143 2.758 (−) deadline 0.730 1.344 1.842 (−) working years 0.650 1.157 1.781 (−) purpose 0.720 1.209 1.679 (−) age 0.788 1.322 1.676 (−) marital status 0.767 1.175 1.532 (−) monthly interests 0.901 1.210 1.342 (−?) loan 0.901 1.116 1.241 N0 debts 0.928 1.114 1.233 N0 telephone 0.937 1.104 1.177 N0 residence 0.983 1.041 1.031 N0 family 0.997 1.016 1.018 N0 others 1.000 0.996 0.994 N0 employment 1.081 0.978 0.904 N0 sex 1.115 0.857 0.769 N0 effects 1.253 0.804 0.642 (+) bad−account 1.178 0.669 0.568 (+) concurrent 1.129 0.620 0.550 (+) house 1.217 0.646 0.531 (+) foreign 3.541 0.966 0.273 (+) • The applicants who possess a good current account (more than DM 200) with a creditor bank are more reliable. In fact, in going from customers who have a medium account or a negative balance (good−account = 0)to customers who have a good account (good−account = 1) the probability of repayment increases; it goes from an odds of 0.594 to an odds of 3.243. Therefore there exists a negative association between unreliability and the possession of a good current account; the exact measure of this association is given by the odds ratio. In the case of good−account, when the account balance is greater than DM 200 then the probability of repayment is about 5.46 times the probability of repayment for clients who have a medium account or a negative balance. • German workers are more reliable than foreign workers. In going from clients who are German workers (foreign = 0) to clients who are foreign workers (foreign = 1) the odds of repayment is reduced from 3.541 to 0.966. This means that a positive association exists between being a foreign worker and being non-reliable. The exact measure of this association is expressed by the odds ratio, and the probability of repayment for foreign workers is 0.273 times that for German workers. In other words, the probability of repayment for CREDIT SCORING 299 German workers is around 3.6 times (1/0.273) the probability of repayment for foreign workers. • those who own effects (effects = 1) or who own a house (house = 1) are less reliable than those who do not own effects or who do not own a house. This could be because house owners have already taken out credit in the form of mortgage. Having to cope with a mortgage could make someone a non-reliable client. 11.4 Model building Having performed a univariate exploratory analysis, we move on to a multivariate analysis, by specifying a statistical model. We are trying to combine all the signals from the different explanatory variables to obtain an overall signal that indicates the reliability of each applicant. In order to choose a model, we have to clarify the nature of the problem. It is clear that we have a predictive classiﬁcation problem, as the response variable is binary and our aim is to predict whether a credit applicant will be reliable or non-reliable. We will concentrate on logistic regression, classiﬁcation trees and multilayer perceptrons, the methods most often used for predictive classiﬁcation in general and credit scoring in particular. We also consider an approach based on bagging, which combines the results from different models. Other methods can also be adopted, notably nearest-neighbour models and probabilistic expert systems, but we will not consider them here. 11.4.1 Logistic regression models We choose a logistic regression model using a forward selection procedure with a signiﬁcance level of 0.05. To check the model, we try a stepwise proce- dure and a backward procedure then verify that all three models are similar. Table 11.7 describes the forward selection procedure. The starting point is the Table 11.7 Results of the forward selection procedure. Step Effect entered Effect removed df Number in Score chi-square Wald chi-square P> chi-square 1 good−account – 1 1 103.9648 – <0.0001 2 previous rep – 1 2 24.4942 – <0.0001 3 bank book – 1 3 17.3725 – <0.0001 4 deadline – 1 4 18.8629 – <0.0001 5 house – 1 5 8.3749 – 0.0038 6 age – 1 6 7.0758 – 0.0078 7 purpose – 1 7 8.4775 – 0.0036 8 foreign – 1 8 7.9316 – 0.0049 9 monthly interests – 1 9 6.9678 – 0.0083 10 marital status – 1 10 5.7610 – 0.0164 300 APPLIED DATA MINING simplest model, containing only the intercept. Then, at every step, we compare the deviances to decide whether or not to add an explanatory variable. SAS Enterprise Miner uses the score chi-squared statistic in the forward pro- cedure and the Wald chi-squared statistic in the backward procedure. According to Table 11.7, the ﬁnal model is obtained in step 10; besides intercept,it includes the following explanatory variables: X1 = deadline X6 = age X2 = previous rep X7 = house X3 = purpose X8 = foreign X4 = bank book X9 = good−account X5 = monthly interests X10 = marital status To check the overall quality of the ﬁnal model, we calculate the likelihood ratio test G2 for the ﬁnal model (H1) against the null model (H0). It turns out that G2 = 219.89 with 10 degrees of freedom. As the corresponding p-value of the test is lower than 0.0001, the null hypothesis is rejected, implying that at least one of the model’s coefﬁcients in Table 11.7 is signiﬁcant. The model has an AIC score of 1023.828, and a BIC score of 1077.814. The total misclassiﬁ- cation rate is 0.244. The misclassiﬁcation rate of a model with all variables present (i.e. without any stepwise model selection) is 0.252, slightly higher than 0.244. Table 11.8 shows the maximum likelihood estimates corresponding to the ﬁnal model and the statistical signiﬁcance of the parameters. For all the explanatory variables we obtain a p-value lower than 0.05, therefore the null hypothesis is always rejected. This means that all the 10 explanatory variables selected using the stepwise procedure are signiﬁcantly associated with the response variable and are useful in explaining whether an applicant is reliable (Y = 0) or not (Y = 1). Table 11.8 Maximum likelihood estimates of the parameters. Parameter df Estimate Standard error Wald chi-square P>chi-square intercept 1 0.5030 0.6479 0.6029 0.4375 deadline 1 −0.6027 0.1567 14.7914 0.0001 previous rep 1 −1.0479 0.2573 16.5875 <0.0001 purpose 1 −0.5598 0.1632 11.7703 0.0006 bank book 1 −0.7870 0.1937 16.5063 <0.0001 monthly interests 1 −0.4754 0.1660 8.2009 0.0042 age 1 −0.4203 0.1603 6.8701 0.0088 house 1 0.4934 0.1683 8.5914 0.0034 foreign 1 1.3932 0.5794 5.7825 0.0162 good−account 1 −1.4690 0.1863 62.1582 <0.0001 marital status 1 −0.3910 0.1633 5.7325 0.0167 CREDIT SCORING 301 Now that we have a model, we need to interpret it. A stepwise procedure may be unstable in the estimates, which are conditional on the selected model. A model-averaging approach, such as a full Bayesian approach, may solve this problem, but it will make the model more complicated (Giudici, 2001a). The obtained logistic regression model can be described by the following formula: log P(Y = 1) P(Y = 0) = β0 + β1X1 + β2X2 +···+β10X10 in which the response variable is credit reliability (Y = 0ifyes,Y = 1 if no) and the explanatory variables are as described in Section 11.2. Table 11.9 shows the parameter estimates and the estimated odds ratios for each variable. We can interpret Table 11.9 using the model formula. This formula is constructed by setting Y = 1 when the debtor is non-reliable, so we can say that a parame- ter with a positive sign indicates that the corresponding variable reduces the debtor’s reliability. Conversely, a parameter with a negative sign indicates that the corresponding variable increases the debtor’s reliability. The variable good−account has a parameter with a negative sign ( ˆβ = −1.4690); this means that clients who have a good current account, above DM 200, present a probability of repayment greater than clients who have a medium account or a negative balance. Analogous arguments are valid for deadline, previous rep, purpose, bank book, monthly interests, age and marital status. We can therefore list eight variables that reduce the risk of non-repayment, or increase the probability of repayment: • A good current account • Previous impeccable repayments • The possession of a bank book • A loan with a short-term deadline Table 11.9 Interpretation of the estimated model. Variables ˆβ e− ˆβ intercept 0.5030 0.605 deadline −0.6027 1.827 previous rep −1.0479 2.852 purpose −0.5598 1.750 bank book −0.7870 2.197 monthly interests −0.4754 1.609 age −0.4203 1.522 house 0.4934 0.611 foreign 1.3932 0.248 good-account −1.4690 4.345 marital status −0.3910 1.479 302 APPLIED DATA MINING • A business purpose for the loan • The presence of high rates of interest • Not being single • Age above 33 years Foreign workers that ask for a loan (foreign = 1) are less reliable than Ger- man workers. This is indicated by the positive sign of the coefﬁcient, ˆβ = 1.3932. Consequently, there is a direct relationship between being a foreign worker and being a non-reliable applicant. As we saw during the exploratory phase, clients who own a house and perhaps have a mortgage (house = 1) are less reliable than clients who do not own their own house. This is indicated by the coefﬁcient ˆβ = 0.4934, which has a positive sign. The odds ratio measures the strength of association between each explana- tory variable and the response variable. Table 11.10 compares the estimated odds ratio with the values from the exploratory analysis. When a client pos- sesses a good current account (good−account = 1) their probability of repay- ment is 4.345 times greater than for a client without an account. Analogous arguments are valid for previous rep, bank book, deadline, purpose, age, monthly interests and marital status. The variables house and foreign are positively associated with the response variable. The proba- bility of repayment for foreign workers (foreign =1) is 0.248 times that for German workers. In other words, the probability of repayment for German work- ers is around 4 times the value for foreign workers. These multivariate odds ratios are more reliable than the univariate odd ratios. They give a better description of the interrelationships between the variables, as each individual association is Table 11.10 Comparison between univariate and multivariate odds ratios. Odds ratios Variable Multivariate Univariate deadline 1.827 1.842 previous rep 2.852 3.958 purpose 1.750 1.679 bank book 2.197 2.758 monthly interests 1.609 1.342 age 1.522 1.676 house 0.611 0.531 foreign 0.248 0.273 good−account 4.345 5.459 marital status 1.479 1.532 CREDIT SCORING 303 corrected by taking into account the indirect effects on the response variable that occur through the remaining explanatory variables. 11.4.2 Classiﬁcation tree models SAS Enterprise Miner allows us to ﬁt three types of tree model. We begin with one based on the CHAID algorithm and the chi-squared impurity measure. To obtain a parsimonious tree, we use a signiﬁcance level of 0.05 in the stopping rule. Figure 11.1 and Table 11.11 present the results from the CHAID classiﬁcation tree analysis. Figure 11.1 is self-explanatory; the total number of terminal nodes is 6, each obtained through successive splits of the chosen binary variables. At each split, the only choice is to decide which variable to use for the split. The total number of splitting variables in the ﬁnal tree is 4: good−account, bank book, previous rep and deadline. These variables are the ﬁrst four obtained by the forward selection procedure for logistic regression (Table 11.7). From the classiﬁcation tree we can see good−account acts on its own, but the other variables interact with each other. This reveals a possible lack of ﬁt when using a logistic regression model that considers only the separate effects of each explanatory variable and no interaction effects. Interactions can obviously Table 11.11 Results for the CHAID classiﬁcation tree. IF GOOD−ACCOUNT EQUALS 1 N : 394 1 : 11.7% 0 : 88.3% IF BANK−BOOK EQUALS 0 N:59 AND PREVIOUS−REP EQUALS 0 1 : 76.3% AND GOOD−ACCOUNT EQUALS 0 0 : 23.7% IF BANK−BOOK EQUALS 1 N:14 AND PREVIOUS−REP EQUALS 0 1 : 28.6% AND GOOD−ACCOUNT EQUALS 0 0 : 71.4% IF DEADLINE EQUALS 1 N : 295 AND PREVIOUS−REP EQUALS 1 1 : 29.5% AND GOOD−ACCOUNT EQUALS 0 0 : 70.5% IF BANK−BOOK EQUALS 1 N:52 AND DEADLINE EQUALS 0 1 : 28.8% AND PREVIOUS−REP EQUALS 1 0 : 71.2% AND GOOD−ACCOUNT EQUALS 0 IF BANK−BOOK EQUALS 0 N : 186 AND DEADLINE EQUALS 0 1 : 55.4% AND PREVIOUS−REP EQUALS 1 0 : 44.6% AND GOOD−ACCOUNT EQUALS 0 304 APPLIED DATA MINING Figure 11.1 Results for the CHAID classiﬁcation tree. CREDIT SCORING 305 be introduced, but this considerably increases the calculations and makes the model harder to interpret. Table 11.11 shows the chosen tree in the form of if-then rules, where the if condition corresponds to a tree path that leads to the then result of a terminal node, characterised by the indicated absolute frequencies (N), percentage of bad applicants (1) and percentage of good applicants (0). The six tree rules can be interpreted as association rules (Section 4.8), all having as their body either Y = 0, or Y = 1. To do this, we need to consider as primitive items not only the level 1 of each variable, but also the complements, for a total of 44 items. Then we obtain results like this: • GOOD−ACCOUNT → NOT RELIABLE has a support of 39.4% and a conﬁ- dence of 11.7%. • BANK BOOK = 1 AND NO PREVIOUS REP AND NO GOOD ACCOUNT → NOT RELIABLE has a support of 1.4% (14/1000) and a conﬁdence of 28.6%. We can calculate the misclassiﬁcation rate as an overall performance measure. In each leaf we classify all the observations according to the majority vote, that is, the class with the highest ﬁtted probability of being present. This corresponds to a cut-off threshold of 0.5. The misclassiﬁcation rate is 0.249, slightly higher than we obtained with the logistic regression model. We now look at a tree model using the CART algorithm and the Gini impurity. For pruning, we calculate the misclassiﬁcation rate on the whole data set using the penalty parameter α = 1. This can be considered as the default choice, in the absence of other considerations. Table 11.12 shows the chosen tree in the form of if-then rules. A graphical representation can easily be constructed from Table 11.12. Compared with the CHAID tree, this one is rather complex and has 33 terminal nodes. The 33 paths in the model can be interpreted as association rules. The extra complexity has lowered the misclassiﬁcation rate to 0.212, obtained on the training data set. But this improvement may not justify the increased complexity. Almost all the explanatory variables are represented in the tree model, except for sex and marital status. This is a remarkable result. There is no differ- ence in reliability by sex or by marital status. It is also interesting to note that all the paths are rather long, with lengths between 4 and 6. We could reduce the complexity of the model by increasing α, but we will leave it at α = 1sowe can compare it with the CHAID tree. Table 11.13 shows a CART model using the entropy impurity and keeping α = 1. This model is also rather complex; it has 34 terminal nodes, one more than the Gini model. The results are also rather similar, but they are not exactly the same. The misclassiﬁcation rate of the entropy model is 0.211 on the training data set, compared with 0.212 for the Gini model. In Section 4.5 we considered the same tree as in Table 11.13, but we stopped at 4 leaves. On the basis of misclassiﬁcation rates it seems that the CART models are better than the CHAID model, and the entropy impurity is slightly better than the Gini impurity. But so far we have only compared goodness of ﬁt, not predictive ability. 306 APPLIED DATA MINING Table 11.12 Results for the CART classiﬁcation tree with Gini impurity. IF FAMILY EQUALS 0 N:10 AND BANK−BOOK EQUALS 1 1 : 10.0% AND PREVIOUS−REP EQUALS 0 0 : 90.0% AND GOOD−ACCOUNT EQUALS 0 IF FAMILY EQUALS 1 N:4 AND BANK−BOOK EQUALS 1 1 : 75.0% AND PREVIOUS−REP EQUALS 0 0 : 25.0% AND GOOD−ACCOUNT EQUALS 0 IF EFFECTS EQUALS 0 N : 194 AND DEADLINE EQUALS 1 1 : 24.7% AND PREVIOUS−REP EQUALS 1 0 : 75.3% AND GOOD−ACCOUNT EQUALS 0 IF FAMILY EQUALS 0 N : 144 AND AGE EQUALS 1 1 : 2.8% AND CONCURRENT EQUALS 0 0 : 97.2% AND GOOD−ACCOUNT EQUALS 1 IF DEBTS EQUALS 0 N:9 AND PURPOSE EQUALS 0 1 : 22.2% AND CONCURRENT EQUALS 1 0 : 77.8% AND GOOD−ACCOUNT EQUALS 1 IF AGE EQUALS 0 N:19 AND PURPOSE EQUALS 1 1 : 0.0% AND CONCURRENT EQUALS 1 0 : 100.0% AND GOOD−ACCOUNT EQUALS 1 IF AGE EQUALS 1 N:10 AND HOUSE EQUALS 0 1 : 90.0% AND BANK−BOOK EQUALS 0 0 : 10.0% AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF FOREIGN EQUALS 0 N:1 AND HOUSE EQUALS 1 1 : 0.0% AND BANK−BOOK EQUALS 0 0 : 100.0% AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF FOREIGN EQUALS 1 N:28 AND HOUSE EQUALS 1 1 : 92.9% AND BANK−BOOK EQUALS 0 0 : 7.1% AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF MONTHLY−INTERESTS EQUALS 0 N : 119 AND BANK−BOOK EQUALS 0 1 : 61.3% AND DEADLINE EQUALS 0 0 : 38.7% CREDIT SCORING 307 Table 11.12 (continued) AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF DEBTS EQUALS 1 N:30 AND EFFECTS EQUALS 1 1 : 26.7% AND DEADLINE EQUALS 1 0 : 73.3% AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF OTHERS EQUALS 0 N:31 AND WORKING−YEARS EQUALS 0 1 : 22.6% AND AGE EQUALS 0 0 : 77.4% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 1 N:2 AND WORKING−YEARS EQUALS 0 1 : 100.0% AND AGE EQUALS 0 0 : 0.0% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF EMPLOYMENT EQUALS 1 N : 107 AND WORKING−YEARS EQUALS 1 1 : 9.3% AND AGE EQUALS 0 0 : 90.7% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 0 N:34 AND FAMILY EQUALS 1 1 : 5.9% AND AGE EQUALS 1 0 : 94.1% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 1 N:1 AND FAMILY EQUALS 1 1 : 100.0% AND AGE EQUALS 1 0 : 0.0% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 1 N:1 AND DEBTS EQUALS 1 1 : 0.0% AND PURPOSE EQUALS 0 0 : 100.0% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF RESIDENCE EQUALS 0 N:3 AND AGE EQUALS 1 1 : 66.7% AND PURPOSE EQUALS 1 0 : 33.3% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 (continued overleaf) 308 APPLIED DATA MINING Table 11.12 (continued) IF RESIDENCE EQUALS 1 N:16 AND AGE EQUALS 1 1 : 12.5% AND PURPOSE EQUALS 1 0 : 87.5% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF LOAN EQUALS 0 N:12 AND AGE EQUALS 0 1 : 33.3% AND HOUSE EQUALS 0 0 : 66.7% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:8 AND AGE EQUALS 0 1 : 75.0% AND HOUSE EQUALS 0 0 : 25.0% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF TELEPHONE EQUALS 0 N:15 AND BAD−ACCOUNT EQUALS 0 1 : 60.0% AND BANK−BOOK EQUALS 1 0 : 40.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF TELEPHONE EQUALS 1 N:10 AND BAD−ACCOUNT EQUALS 0 1 : 20.0% AND BANK−BOOK EQUALS 1 0 : 80.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF OTHERS EQUALS 0 N:26 AND BAD−ACCOUNT EQUALS 1 1 : 11.5% AND BANK−BOOK EQUALS 1 0 : 88.5% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF OTHERS EQUALS 1 N:1 AND BAD−ACCOUNT EQUALS 1 1 : 100.0% AND BANK−BOOK EQUALS 1 0 : 0.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 0 N:61 AND MONTHLY−INTERESTS EQUALS 1 1 : 41.0% AND BANK−BOOK EQUALS 0 0 : 59.0% CREDIT SCORING 309 Table 11.12 (continued) AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:6 AND MONTHLY−INTERESTS EQUALS 1 1 : 83.3% AND BANK−BOOK EQUALS 0 0 : 16.7% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:42 AND DEBTS EQUALS 0 1 : 52.4% AND EFFECTS EQUALS 1 0 : 47.6% AND DEADLINE EQUALS 1 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 0 N:29 AND DEBTS EQUALS 0 1 : 31.0% AND EFFECTS EQUALS 1 0 : 69.0% AND DEADLINE EQUALS 1 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF BANK−BOOK EQUALS 0 N:11 AND EMPLOYMENT EQUALS 0 1 : 18.2% AND WORKING−YEARS EQUALS 1 0 : 81.8% AND AGE EQUALS 0 AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF BANK−BOOK EQUALS 1 N:4 AND EMPLOYMENT EQUALS 0 1 : 75.0% AND WORKING−YEARS EQUALS 1 0 : 25.0% AND AGE EQUALS 0 AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF EMPLOYMENT EQUALS 1 N:11 AND OTHERS EQUALS 0 1 : 81.8% AND DEBTS EQUALS 1 0 : 18.2% AND PURPOSE EQUALS 0 AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF EMPLOYMENT EQUALS 0 N:1 AND OTHERS EQUALS 0 1 : 0.0% AND DEBTS EQUALS 1 0 : 100.0% AND PURPOSE EQUALS 0 AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 310 APPLIED DATA MINING Table 11.13 Results for the CART classiﬁcation tree with entropy impurity. IF FAMILY EQUALS 0 N:10 AND BANK−BOOK EQUALS 1 1 : 10.0% AND PREVIOUS−REP EQUALS 0 0 : 90.0% AND GOOD−ACCOUNT EQUALS 0 IF FAMILY EQUALS 1 N:4 AND BANK−BOOK EQUALS 1 1 : 75.0% AND PREVIOUS−REP EQUALS 0 0 : 25.0% AND GOOD−ACCOUNT EQUALS 0 IF EFFECTS EQUALS 0 N : 194 AND DEADLINE EQUALS 1 1 : 24.7% AND PREVIOUS−REP EQUALS 1 0 : 75.3% AND GOOD−ACCOUNT EQUALS 0 IF FAMILY EQUALS 0 N : 144 AND AGE EQUALS 1 1 : 2.8% AND CONCURRENT EQUALS 0 0 : 97.2% AND GOOD−ACCOUNT EQUALS 1 IF DEBTS EQUALS 0 N:9 AND PURPOSE EQUALS 0 1 : 22.2% AND CONCURRENT EQUALS 1 0 : 77.8% AND GOOD−ACCOUNT EQUALS 1 IF AGE EQUALS 0 N:19 AND PURPOSE EQUALS 1 1 : 0.0% AND CONCURRENT EQUALS 1 0 : 100.0% AND GOOD−ACCOUNT EQUALS 1 IF AGE EQUALS 1 N:10 AND HOUSE EQUALS 0 1 : 90.0% AND BANK−BOOK EQUALS 0 0 : 10.0% AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF CONCURRENT EQUALS 0 N:18 AND HOUSE EQUALS 1 1 : 100.0% AND BANK−BOOK EQUALS 0 0 : 0.0% AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:43 AND BANK−BOOK EQUALS 0 1 : 69.8% AND DEADLINE EQUALS 0 0 : 30.2% AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF DEBTS EQUALS 1 N:30 AND EFFECTS EQUALS 1 1 : 26.7% CREDIT SCORING 311 Table 11.13 (continued) AND DEADLINE EQUALS 1 0 : 73.3% AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF OTHERS EQUALS 0 N:31 AND WORKING−YEARS EQUALS 0 1 : 22.6% AND AGE EQUALS 0 0 : 77.4% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 1 N:2 AND WORKING−YEARS EQUALS 0 1 : 100.0% AND AGE EQUALS 0 0 : 0.0% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF EMPLOYMENT EQUALS 1 N : 107 AND WORKING−YEARS EQUALS 1 1 : 9.3% AND AGE EQUALS 0 0 : 90.7% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 0 N:34 AND FAMILY EQUALS 1 1 : 5.9% AND AGE EQUALS 1 0 : 94.1% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF OTHERS EQUALS 1 N:1 AND FAMILY EQUALS 1 1 : 100.0% AND AGE EQUALS 1 0 : 0.0% AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF RESIDENCE EQUALS 1 N:3 AND DEBTS EQUALS 1 1 : 100.0% AND PURPOSE EQUALS 0 0 : 0.0% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF RESIDENCE EQUALS 0 N:3 AND AGE EQUALS 1 1 : 66.7% AND PURPOSE EQUALS 1 0 : 33.3% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF RESIDENCE EQUALS 1 N:16 AND AGE EQUALS 1 1 : 12.5% AND PURPOSE EQUALS 1 0 : 87.5% AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 (continued overleaf) 312 APPLIED DATA MINING Table 11.13 (continued) IF LOAN EQUALS 0 N:12 AND AGE EQUALS 0 1 : 33.3% AND HOUSE EQUALS 0 0 : 66.7% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:8 AND AGE EQUALS 0 1 : 75.0% AND HOUSE EQUALS 0 0 : 25.0% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF WORKING−YEARS EQUALS 0 N:2 AND CONCURRENT EQUALS 1 1 : 0.0% AND HOUSE EQUALS 1 0 : 100.0% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF WORKING−YEARS EQUALS 1 N:9 AND CONCURRENT EQUALS 1 1 : 88.9% AND HOUSE EQUALS 1 0 : 11.1% AND BANK−BOOK EQUALS 0 AND PREVIOUS−REP EQUALS 0 AND GOOD−ACCOUNT EQUALS 0 IF TELEPHONE EQUALS 0 N:15 AND BAD−ACCOUNT EQUALS 0 1 : 60.0% AND BANK−BOOK EQUALS 1 0 : 40.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF TELEPHONE EQUALS 1 N:10 AND BAD−ACCOUNT EQUALS 0 1 : 20.0% AND BANK−BOOK EQUALS 1 0 : 80.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF OTHERS EQUALS 0 N:26 AND BAD−ACCOUNT EQUALS 1 1 : 11.5% AND BANK−BOOK EQUALS 1 0 : 88.5% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 CREDIT SCORING 313 Table 11.13 (continued) IF OTHERS EQUALS 1 N:1 AND BAD−ACCOUNT EQUALS 1 1 : 100.0% AND BANK−BOOK EQUALS 1 0 : 0.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF MONTHLY−INTERESTS EQUALS 0 N:82 AND LOAN EQUALS 0 1 : 58.5% AND BANK−BOOK EQUALS 0 0 : 41.5% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF MONTHLY−INTERESTS EQUALS 1 N:61 AND LOAN EQUALS 0 1 : 41.0% AND BANK−BOOK EQUALS 0 0 : 59.0% AND DEADLINE EQUALS 0 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 1 N:42 AND DEBTS EQUALS 0 1 : 52.4% AND EFFECTS EQUALS 1 0 : 47.6% AND DEADLINE EQUALS 1 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF LOAN EQUALS 0 N:29 AND DEBTS EQUALS 0 1 : 31.0% AND EFFECTS EQUALS 1 0 : 69.0% AND DEADLINE EQUALS 1 AND PREVIOUS−REP EQUALS 1 AND GOOD−ACCOUNT EQUALS 0 IF BANK−BOOK EQUALS 0 N:11 AND EMPLOYMENT EQUALS 0 1 : 18.2% AND WORKING−YEARS EQUALS 1 0 : 81.8% AND AGE EQUALS 0 AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 IF BANK−BOOK EQUALS 1 N:4 AND EMPLOYMENT EQUALS 0 1 : 75.0% AND WORKING−YEARS EQUALS 1 0 : 25.0% AND AGE EQUALS 0 AND CONCURRENT EQUALS 0 AND GOOD−ACCOUNT EQUALS 1 (continued overleaf) 314 APPLIED DATA MINING Table 11.13 (continued) IF BANK−BOOK EQUALS 0 N:8 AND RESIDENCE EQUALS 0 1 : 75.0% AND DEBTS EQUALS 1 0 : 25.0% AND PURPOSE EQUALS 0 AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 IF BANK−BOOK EQUALS 1 N:2 AND RESIDENCE EQUALS 0 1 : 0.0% AND DEBTS EQUALS 1 0 : 100.0% AND PURPOSE EQUALS 0 AND CONCURRENT EQUALS 1 AND GOOD−ACCOUNT EQUALS 1 11.4.3 Multilayer perceptron models To specify a multilayer perceptron, we need to decide on its architecture. Given the nature of this problem, we choose a single layer of hidden nodes and we make both activation functions logistic, from the input to the hidden nodes and from the hidden nodes to the output. The output nodes are combined through a softmax function. According to the SAS Enterprise Miner implementation of the multilayer perceptron, we choose a back propagation estimation algorithm for the weights, with a momentum parameter of 0.1. The error function is binomial, as in Section 4.6. To choose the optimal number of nodes in the hidden layer, we begin with a single node and proceed stepwise until the misclassiﬁcation rate starts to decrease. With 3 nodes it is 0.182, with 4 it is 0.141 and with 5 its 0.148. This sug- gests a multilayer perceptron with 4 nodes. Therefore the architecture of our ﬁnal network contains 22 input nodes, 4 hidden nodes and 1 output node. The corresponding number of weight parameters is 97. Unlike logistic regression and tree models, neural networks are black boxes. There are no interesting structures to see, besides the ﬁtted 0–1 values for each observation, obtained according to the 0.5 threshold rule, from which we derive the misclassiﬁcation rate. Unlike tree models, the multilayer perceptron can be embedded in a parametric (binomial) framework. This allows us to obtain the model scores, which can then be compared with logistic regression scores. For our ﬁnal neural network model, we have AIC = 1634.30 and BIC = 2110.35. Both are considerably higher than for the ﬁnal logistic regression model, indicating a possible improvement. 11.5 Model comparison To help us choose a ﬁnal model, we extend our performance analysis to include criteria based on loss functions. For all our models we begin by splitting the CREDIT SCORING 315 available data into a training data set, containing 75% of the observations, and a validation data set, containing 25% of the observations. We do this in a stratiﬁed way to maintain the proportions 70% reliable and 30% non-reliable in the new data sets. After ﬁtting each model on the training data set, we use it to classify the observations in the validation data set. This classiﬁcation is reached by producing a score and then using a threshold cut-off to classify those above the threshold as Y = 1 and as those below the threshold as Y = 0. Finally, each model is evaluated by assessing the misclassiﬁcation rate. We begin with the logistic regression model and classiﬁcation errors for a cut-off threshold of 50% (corresponding to the discriminant rule). According to this threshold, all the applicants whose estimated probability of non-reliability (Y = 1) is greater than 50% are predicted as non-reliable clients, otherwise they are classiﬁed as reliable clients. This model correctly predicts 90.29% of the reliable clients (Y = 0). The probability of committing a type II error is 9.71%. A type II error means taking a reliable client and predicting it as non-reliable. The model is less effective at predicting non-reliable clients; in fact, it predicts only 39.56% of them correctly. The probability of committing a type I error is 60.44%. A type I error means taking a non-reliable client and predicting it as reliable. It seems that the model has greater difﬁculty in predicting non-reliable clients than reliable ones. This is quite common in credit-scoring problems. The main difﬁculty of score- card models is in predicting the bad risks. But we need models that can predict bad risk effectively, because type I errors are usually more costly than type II errors. The previous error rates are obtained for a 50% cut-off, but a lower cut-off might allow us to catch a greater number of bad repayers. A 30% cut-off reduces the type I error to 24.44%, but the type II error rises from 9.71% to 22.80%. The cut-off threshold should be chosen to suit the costs of the type I and type II errors. If the costs are similar, at 50% cut-off will be ﬁne; otherwise, a different threshold may be better. In credit-scoring problems, where type I error is usually more costly, a cut-off lower than 50% is probably advisable. How much below depends on the cost function. The ROC curve, which shows how the errors change when the threshold varies, can be used for this purpose. Before looking at the ROC curve, we compare the predictive misclassiﬁcation rates, at 50% cut-off, for the logistic regression model, the classiﬁcation tree and the neural network. It turns out that the tree model has the best performance, with a misclassiﬁcation rate of 0.244, followed by the multilayer perceptron at 0.248 and the logistic regression model at 0.280. Concerning type I errors, the logistic regression model shows a 60.44% probability against 54.67% for the tree model and 64.79% for the neural network. We now compare the three models in terms of their ROC curves and the Gini index of performance. The higher the point on the curve, the lower the cut-off threshold, before applicants are estimated to be non-reliable. Figure 11.2 shows the ROC curves for our three ﬁnal models; all are calculated using the same random partitioning of the data. It shows the point for 50% cut-off using the decision tree, which is the best model when using 50% cut-off. The pre- dictive behaviour of the three models is rather similar. The logistic regression 316 APPLIED DATA MINING Figure 11.2 ROC curves for the ﬁnal models. model appears slightly inferior to the other two, but not as bad as it appeared on the misclassiﬁcation rates alone. For a clearer comparison, we calculate the Gini index of performance; the classiﬁcation tree has the highest value (0.6260), followed by the logistic regression model (0.5798) and then the neural net- work (0.5738). Figure 11.3 is a lift chart. A lift chart gives, for each decile, the percentage of predicted events (here non-reliable applicant). If the model were perfect, this percentage would be 100% for the ﬁrst three deciles (as this is the proportion of true events) and equal to zero for the other seven deciles. From Figure 11.3 it appears that the models are rather similar for the last seven deciles (with the neural network a bit worse, probably due to overﬁtting); and in the ﬁrst three deciles, the most critical region for credit scoring, the tree outperforms the logistic regression model, and although they are very different in nature, the tree and the neural network have a similar performance. To summarise, the tree seems to be the best-performing model, but the differ- ences are rather slight. TEAM FLY CREDIT SCORING 317 Figure 11.3 Lift chart for the ﬁnal models. We now consider whether a combined model leads to a better classiﬁcation performance. Given the potential instability of the tree model, we try to improve it using the bagging algorithms in SAS Enterprise Miner. We take 10 random samples for both algorithms. Each sample is split randomly and in a stratiﬁed way into a training data set and a validation data set, and the observations in the validation data set are calculated according to the majority rule out of the 10 classiﬁcation from the CART model using entropy impurity. As a result, we obtain a total misclassiﬁcation rate of 0.224, with a type I error probability of about 48%. This shows a notable improvement over the single-tree model (which had a total misclassiﬁcation rate of 0.244 and a type I error probability of about 54%). Figure 11.4 shows the ROC curves plus the 50% cut-off point using the com- bined tree model. The combined model is rather similar to the single-tree model. Indeed the Gini index of performance for the combined tree model is 0.5724, slightly worse than for the single tree. Therefore, if the cut-off threshold is to be chosen, not ﬁxed at 50%, then perhaps it would be better to keep a single tree. But if the cut-off is ﬁxed at 50%, then the combined tree is the better performer. 318 APPLIED DATA MINING Figure 11.4 ROC curves for the bagged tree model and the single-tree model. Table 11.14 Comparison of the bagged model with the three individual models. Now we use unweighted majority voting to combine the results from the regression model the tree model and the neural network. Table 11.14 shows the results. Although the combined model is the best one on the training data set, in terms of predictive classiﬁcation it is outperformed by the tree model, which proves to be the best one. However, notice that the difference in perfor- mance is very small, no more than 0.04. The type I error probability of the combined model is 56%, worse than the single tree. Figure 11.5 shows the ROC curves for this comparison. Notice that the combined model does bet- ter than the tree for low cut-off values, but then the type I error is too high. The Gini index of performance for the combined model is 0.5699, lower than CREDIT SCORING 319 Figure 11.5 ROC curves for the bagged tree model and its component models. before. Therefore the tree model, which does better at high cut-off values, is to be preferred. To conclude, the best model for classifying the data set is the single-tree model, or if computational resources allow, the bagged tree model. However, all the ﬁnal models have a rather similar performance, so it may make sense to choose the most transparent model, namely, logistic regression. 11.6 Summary report • Context: this case study concerns credit scoring. It may also be applied to any situations where the objective is to score the past behaviour of an individual or company in order to plan a future action on the same individual or company in a CRM framework. The score can then be used to evaluate credit reliability, customer loyalty or customer churnability. Furthermore, it can be used to select clients in order to maximise the return on an investment (e.g. clients to 320 APPLIED DATA MINING receive a promotional campaign, clients to involve in a one-to-one banking relationship, clients to target with a personalised gift). • Objectives: the aim of the analysis is to build a scoring rule that attaches a numerical value to each client. • Organisation of the data: the data is all the information available in a bank on each applicant for consumer credit, including individual data and bank- ing behaviour data. There are 21 categorical variables, one of which is the observed credit reliability, used as a supervisor target variable to build up a credit-scoring rule able to discriminate reliable debtors from non-reliable debtors. A credit-scoring rule should be able to tell which are the discriminant variables and give their weight in the ﬁnal score. • Exploratory data analysis: this phase was conducted using odds ratio analysis, as the available variables were all discrete (actually binarised). The odds ratios suggest which explanatory variables may be discriminant. Two of the original variables were rather confusing, so they were subdivided into new binary variables, giving a total of 22 explanatory variables. • Model speciﬁcation: the analysis objective suggested a predictive model, able to ﬁnd a rule that splits debtors into homogeneous categories and then attaches to each category a score expressed as a probability of reliability. We considered the three types of model that are typically used in credit- scoring problems: logistic regression, classiﬁcation trees and multilayer per- ceptrons. • Model comparison: the models were compared using statistical or scoring- based criteria, such as G2, AIC and BIC as well as the misclassiﬁcation rate on the whole data set. There was not enough data to rely on cross- validation alone. The goodness-of-ﬁt comparison showed that neural net- works performed best, followed by logistic regression and classiﬁcation trees. We then considered a cross-validation approach, and compared classiﬁca- tion errors on a validation data set. To convert a score into a 0–1 estimate (good or bad debtors) we assumed a threshold of 50%. Then the tree model had the best performance, followed by the multilayer perceptron and then the logistic regression model. However, in terms of type I errors, which are usually more costly in these types of problem, logistic regression out- performed neural networks. To obtain a result independent of the chosen threshold, we compared the ROC curves and calculated the Gini index of performance. This time the classiﬁcation tree came out best, conﬁrmed by the lift chart. Given the rather limited amount of data and the potential instability of tree models, we tried to improve our model by bagging; the results for the bagged model were considerably better when a 50% threshold was chosen. • Model interpretation:: on the basis of model comparison, it seems that clas- siﬁcation trees, or their bagged version, do the best job for this problem. But logistic regression models are not so inferior on this considered data set, especially if type I errors are emphasised. The choice should also depend on CREDIT SCORING 321 how the results will be used. If decision makers look for hierarchical ‘what if’ rules, which classify clients into risk class proﬁles, then classiﬁcation trees are very good. On the other hand, if they desire analytic rules, which attach an impact weight to each explanatory variable (measured by a regression coefﬁcient or an odds ratio), then logistic regression is better. CHAPTER 12 Forecasting television audience 12.1 Objectives of the analysis This case study compares a number of statistical techniques for forecasting tele- vision audiences in the Italian market. Programme planners want a forecasting method that will help them devise schedules to maximise audiences. To do this, they need to know which programmes on a channel reach the maximum audi- ence, for a given choice of the competing channels. There are essentially two strategies: counterplanning reacts to competitors’ programmes with a programme that is very different, in order to capture the remaining public; and competitive programming reacts to competitors’ programmes with programmes of the same type, hoping to attract a higher audience share based on the programme’s quality. Television schedules are planned in advance. The data in this study consists of programme plans for 3–5 month periods, planned at least 4 months in advance. These strategies embody the notion of a programme’s value. A programme’s value is often related to the perceived impact of an advertisement with a slot during its broadcast. The higher a programme’s audience, the higher the publicity cost to insert an advertisement in the programme. In this case study we consider audience data from the Italian television market. By magnitude and composition, this is one of the most complex television markets in the world and it therefore represents an important example for other markets too. In the Italian television market there are six leading channels, and several smaller channels; most of these smaller channels are local but some are national. All six leading channels are national, as they reach almost 100% of the country. Three are owned by the state: rai1, rai2, rai3. The other three are owned by a private company called Mediaset: rete4, canale5 (can5), italia1 (ita1). Although the real competition is between Mediaset and the state, each owning three channels, the individual channels have a degree of autonomy, so it makes sense to talk about competition between them all. Data on the television audience, at regional level and national level, is provided in Italy by an institutional company called Auditel. Auditel selects a panel – a stratiﬁed sample of viewing families – to represent the different geographic, demographic and sociocultural characteristics of the national population. The Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 324 APPLIED DATA MINING selected individuals remain on the panel for a period of about ﬁve years; each year about 20% of them are replaced. Each television in the house of a panel fam- ily is ﬁtted with an electronic meter. The meter continually records viewing data: whether the television is on and to which channel it is tuned. The panel consists of 5000 families, containing a total of 15 000 individuals, and data is collected from 8000 television meters. It is one of the largest television audience panels in the world. The panel audience for a programme is extrapolated to the whole pop- ulation using weights that reﬂect the representation of each population stratum in the panel. The extrapolated data is the database for our case study; we will not examine the extrapolation mechanism. Here are some indicators for measuring the audience of a channel; this case study uses the indicator called share. • Reach indicates the total number of distinct (unique) visitors that, in a given time interval, have seen the channel for at least one minute. • Mean audience is the sum of the number of viewers per minute divided by the number of minutes in the considered time interval. • Total audience is the sum of the mean audience for all the considered chan- nels, in the considered time interval. • Share is a percentage for each channel given by the ratio of the channel’s mean audience to the total audience in the given time interval. The share of a channel is deﬁned for a given time interval; it is the percentage of people that are watching that channel, averaged over the time interval. Here we will consider the six leading channels and put the rest into a ‘residual’ channel. This will give us seven channels and seven shares, which sum to 100%. Our objective is to predict the shares of the six leading channels, for a given menu of broadcast programmes. It is difﬁcult to predict a channel’s share, because it depends on factors that can be hard to quantify, such as the preferences and habits of the viewers, and the quality of the broadcast programme. Programme quality depends on many factors besides content; two of them are programme history and coverage by other media. Many viewers enjoy a varied diet of tele- vision and show greater loyalty to a channel than to a programme or type of programme. Although this behaviour is gradually declining, it can have a con- siderable effect on audience share predictions and it does still exist, especially among older viewers. To summarise, our problem is to predict the shares of the six leading channels. We therefore have a predictive data mining problem aimed at predicting a multiple target, consisting of six variables, something we have not covered up to now. This data set is analysed from a Bayesian perspective in Giudici (1998). 12.2 Description of the data We will use one year of observations from 29 November 1995 to 28 November 1996 for the hours 20:30 to 22:30, known as prime time. For every minute of FORECASTING TELEVISION AUDIENCE 325 these hours, the meter records the channel to which the television is tuned. The channel can be one of the six leading channels or channel 7, which contains all the rest. The data matrix consists of 366 multivariate observations; for each prime time during the considered year (a row in the matrix) it shows general and channel-speciﬁc information. The general information is the date, the day of the week (Sunday is 1 and Saturday is 7) and the total number of television viewers (in thousands). The channel-speciﬁc information is the share of each channel, the title of the broadcast programme and the type of the programme. Table 12.1 shows an extract from the original data matrix; it contains the gen- eral information and information speciﬁc to channels rai1 and rai2. On Thursday 29 November 1995 there were, on average, 28 916 thousand people watching television (about half the country’s population) during prime time: 21.7% were watching rai1, broadcasting the teleﬁlm (TF) Solo per il tuo bene (Only for your love); 14.2% were watching rai2, broadcasting the ﬁlm (F) royce. The data matrix contains similar information for the other four channels. Now we need some more background information. The shares during prime time exhibit high variability. This is because they measure how a large total audience is distributed among the channels, which usually broadcast their better programmes in this period and generally behave according to a competitive pro- gramming strategy, instead of the counterplanning strategy they follow during the rest of the day. The high variability means that prime time shares are usually rather difﬁcult to predict with reasonable accuracy. The six leading channels take about 92% of the television audience, so they can be treated as highly repre- sentative of the public’s choices. The share of the other networks is treated as residual, and this means it is obtained by subtraction: share7t = 100 − 6 i=1 shareit where the index i = 1,...,6 refers to the six main channels and the index t = 1,...,366 refers to the considered day. The programme titles cannot really be used to obtain explanatory variables of the shares, as in most cases they are unique, so there is no variability on which to build a statistical analysis. Furthermore, even when a programme is Table 12.1 Extract from the original data matrix. 326 APPLIED DATA MINING Table 12.2 Classiﬁcation of programmes into 14 groups. F = Film SC = Sport (football, ofﬁcial league) TF = Teleﬁlm SCA = Sport (football, friendly) PV = Variety programme SA = Sport (different from football) PVS = Special variety programme TSI = Italian ﬁction, in episodes P = Special programme TS = Non-Italian ﬁction, in episodes PI = Information programme TMI = Italian TV movie D = Theatre TM = Non-Italian TV movie Table 12.3 Classiﬁcation of the 14 groups into 5 categories. Film: F Tv-Movie: TF, TS, TSI, TM, TMI Shows:PV,PVS,P Information: PI, D Sport: SC, SCA, SA repeated (e.g. a show), it will never be the same programme, as it will probably have new content. We therefore need to classify programmes into homogeneous categories by programme type. This classiﬁcation is a critical issue, as it affects the quality of the ﬁnal predictions. Here we have the classiﬁcation provided by the marketing experts of the Italian private network, who have kindly supplied the datas. For each combination of day and channel, the data matrix shows the programme type, as can be seen in Table 12.1. The programme classiﬁcation puts each programme into one of 14 groups listed in Table 12.2. Given the somewhat limited amount of available data, having 14 groups will mean there are too many to make accurate predictions. We therefore aggregate them into 5 groups as logically as possible, based on discussions with marketing experts. Table 12.3 shows this new classiﬁcation and how it relates to the old one. There is a price to pay for this greater simplicity; it may lead to less accurate predictions because the new classes may be more heterogenous. But this danger is not entirely removed by considering ﬁner groupings, like the original one. We now have 6 new programme type variables, one for each channel. Each of them is a qualitative variable with 5 possible levels. To be parsimonious, we transform each of them into as many binary variables as their levels. This gives a total of 6 × 5 = 30 binary variables, one for each programme type and channel combination. We apply a similar binarisation to the days of the week, obtaining a total of 7 binary variables. We have therefore taken the original 6 + 1 qualitative variables in the data matrix and replaced them with 37 binary variables. Table 12.4 shows their values for the same evenings as in Table 12.1. The share of rai2 is missing for one evening. We delete this item from the matrix as there is no obvious way to estimate it. The total number of observations is now 365. FORECASTING TELEVISION AUDIENCE 327 Table 12.4 Binarisation of the qualitative explanatory variables. 12.3 Exploratory data analysis Exploratory analysis is essential when tackling a difﬁcult problem like this one. We want to predict the shares of the six leading channels, so we examine the distribution of these shares over the year. Table 12.5 shows some simple sum- mary statistics for each channel share. Notice that there appear to be two leading channels: rai1 and can5. These can be considered as the leading channels of each of the two networks (public and private); they typically compete for the leader- ship, and usually they can be considered as ‘generalist’, aiming at the general public. In the considered year, Table 12.4 indicates that rai1 had a higher overall mean share of 23.80, against 22.13 for can5. The other channels follow, with rai2 (14.76) coming before ita1 (11.73), rai3 (11.06) and rete4 (8.38). Generally, rai2 and rete4 are both targeted at a mature public, whereas rai3 and ita1 are targeted at a younger public; rai3 is more cultural than ita1. The variability of the shares, expressed by the standard deviation, varies considerably, with leading channels having a higher variability. And the wide range of the observations, expressed by the minimum and maximum shares, makes predictions rather difﬁcult. To understand the inﬂuence of programme type, we obtain the boxplot of each share, conditional on the programme types (Figure 12.1). There appear to be out- lying observations (evenings). These manifest themselves for different channels and different programme types. The shares of the six leading channels add up to Table 12.5 Summary statistics on the channel shares. Variable N Mean Std Dev Minimum Maximum shrrai1 365 23.7915068 7.6157088 9.2000000 72.5000000 shrrai2 365 14.7591781 4.7044466 3.0000000 44.7000000 shrrai3 365 11.0575342 3.7992668 2.6000000 40.5000000 shrrete4 365 8.3783562 2.2547867 2.7000000 17.5000000 shrcan5 365 22.1252055 5.9504290 5.3000000 60.3000000 shrita1 365 11.7336986 2.9919193 3.1000000 23.4000000 328 APPLIED DATA MINING Figure 12.1 Boxplots for the six channels using the ﬁve programme types. FORECASTING TELEVISION AUDIENCE 329 Figure 12.1 (continued) 330 APPLIED DATA MINING a fairly stable ﬁgure of about 92%. So when there is an outlier higher than the median, other channels may have an outlier lower than the median. In any case, considering the very erratic nature of television shares, we do not remove any outliers, although removing them would improve the accuracy of the predictions. We are trying to build a model that can predict shares, whatever the menu of available programmes. Figure 12.1 suggests the shares depend on the programme types in differ- ent ways. For instance, on rai1 Show programmes and Sport programmes seem to increase shares, with other types near the overall median. For can5 there is a similar behaviour. The other channels are more thematic; they show differ- ent effects. Generally, Sport programmes increase the shares; other programmes differ according to where they are broadcast: • rai2 does better with Sports programmes and Tv-Movie programmes, worse with Information programmes. • rete4, which is also rather dependent on programme types, does better with Sport programmes and Show programmes, worse with Information programmes. • rai3, which never broadcasts Tv-Movie programmes in the considered period, does better with Sport programmes and Show programs, worse with Film programmes. • ita1, which has a very limited amount of Information programmes, does bet- ter with Sport programmes and Tv-Movie programmes, worse with Film programmes and Show programmes. It is very important to understand the nature of a channel – the programmes which characterise it and create its image. Table 12.6 shows the frequency distri- bution of the programme type variables, using our ﬁve-level classiﬁcation. The channel images now begin to emerge. For instance, the distributions of rai1 and can5 are rather similar: rai1 has percentages higher than other channels on the most popular programme types, Show programmes and Sport programmes; can5 does this on Show programmes but is beaten by rai2 on Sport programmes. In the considered period, most football matches were broadcast on the public channels, and this explains the difference between rai1 and can5. The other channels are more speciﬁc, as they all differ notably. Differences between the channels can be better understood by comparing them within the two networks, public and private. This appears to show that rai1 and can5 behave as competitive channels, with the others behaving more like coun- terplanners. For instance, in the public network, rai2 broadcasts a high percentage of Film programmes and Tv-Movie programmes, whereas rai3 broadcasts many Information programmes and Film programmes. In the private network, rete4 broadcasts many Film programmes, and ita1 broadcasts many Film programmes and Tv-Movie programmes. This leads to highly symmetrical planning behaviour between the two networks. Another important explanatory variable is the total audience. Variation in the total audience is usually related to speciﬁc groups of people (i.e. those who tend to be at home less often), hence it will be reﬂected in FORECASTING TELEVISION AUDIENCE 331 Table 12.6 Distribution of the programme types by channel. rai1 Frequency Percent Cumulative Frequency Cumulative Percent Film 103 28.22 103 28.22 Information 33 9.04 136 37.26 Show 127 34.79 263 72.05 Sport 33 9.04 296 81.10 Tv−Movie 69 18.90 365 100.00 rai2 Frequency Percent Cumulative Frequency Cumulative Percent Film 170 46.58 170 46.58 Information 9 2.47 179 49.04 Show 45 12.33 224 61.37 Sport 23 6.30 247 67.67 Tv−Movie 118 32.33 365 100.00 rai3 Frequency Percent Cumulative Frequency Cumulative Percent Film 152 41.64 152 41.64 Information 172 47.12 324 88.77 Show 17 4.66 341 93.42 Sport 24 6.58 365 100.00 rete4 Frequency Percent Cumulative Frequency Cumulative Percent Film 251 68.77 251 68.77 Information 29 7.95 280 76.71 Show 23 6.30 303 83.01 Sport 5 1.37 308 84.38 Tv−Movie 57 15.62 365 100.00 can5 Frequency Percent Cumulative Frequency Cumulative Percent Film 117 32.05 117 32.05 Information 17 4.66 134 36.71 Show 144 39.45 278 76.16 Sport 19 5.21 297 81.37 Tv−Movie 68 18.63 365 100.00 Ita1 Frequency Percent Cumulative Frequency Cumulative Percent Film 177 48.49 177 48.49 Information 1 0.27 178 48.77 Show 27 7.40 205 56.16 Sport 9 2.47 214 58.63 Tv−Movie 151 41.37 365 100.00 332 APPLIED DATA MINING Figure 12.2 Distribution of the total audience. changes of share. Figure 12.2 show that the distribution of the total audience is rather asymmetric to the right; it is also extremely variable. Therefore it makes sense to include the total audience as an explanatory variable, and we standardise it to help with interpretation. Other variables can be derived from the data matrix. First of all, each channel share depends not only on its own programme, but also on the programmes broadcast by the other channels. This effect is generally less marked, but we consider it in Section 12.4. Another relevant explanatory variable is the day of the week. Its informative content may be included in the programme types (which usually entail a speciﬁc day) and the total audience, but there may also be other effects due to the day of the week. Other temporal variables may also be included, such as the month and the season. After discussion with marketing experts and some practical experimentation, we decide not to consider them. To build a statistical model, it is often important to understand which probabil- ity model can be used to describe the distribution of the response variable. Here the response variable is a 6-variate vector of shares. The shares are continuous quantitative variables but they have a limited range. The Gaussian distribution is typically used to model multivariate continuous variables, yet it may not be entirely appropriate here. This is conﬁrmed by a closer examination of the dis- tribution of the shares. Figure 12.3 shows the distribution of the shares of can5, along with the Gaussian approximation and the kernel density estimate, based on a Gaussian kernel (Section 5.2). The Gaussian curve is distant from the kernel density estimator. But given the complexity of the analysis, it may be prudent to use a linear model, as linear models often produce explicit results that are easy to interpret. Since a linear model is based on the Gaussian assumption, we can try to meet this assumption by transforming the response vector. When we include channel 7, others, the shares sum to 100%. They are propor- tions, and proportions can be transformed to a multivariate Gaussian distribution by using the following logistic transformation. Assume there exist some explana- tory variables that allow us to determine the structure of the parameter πit ,which FORECASTING TELEVISION AUDIENCE 333 Figure 12.3 Histogram, Gaussian and kernel density estimated curves for shrcan5. expresses the aggregated probability that each subject chooses the ith channel on the tth evening. This can be seen as a model for individual preferences. The dis- tribution of a binary variable is deﬁned by only two probabilities, π and 1 − π. To describe a logistic transformation for a binary variable, we deﬁne a logit θ = log π 1 − π Suppose the variables are n-ary rather than binary, as in this case, where for each evening t there are seven probabilities of channel choice, which sum to 1. Then a possible logistic transformation is θi = log πi π7 for i = 1,...,6 One of the seven probabilities (here π7, corresponding to others) is chosen as a comparison term and is therefore omitted from the analysis. We now use this transformation on the observed shares. The response variable is deﬁned by the following transformation: yit = log shareit share7t for i = 1,...,6andt = 1,...,365 where yit represents the logit of shareit (share relative to the ith channel and to the tth evening). It can be shown (e.g. Johnson and Wichern, 1982) that the previous trans- formation has an approximately multivariate Gaussian distribution for the 6- dimensional vector of the logit share so deﬁned. Table 12.7 shows the values of these logit shares for the six evenings in Table12.1. It contains a logit share 334 APPLIED DATA MINING Table 12.7 Logit shares for the evenings in Table 12.1. Figure 12.4 Histogram, parametric (red) and kernel density (purple) estimated curves for logitshrcan5. equal to 0. This corresponds to a share equal to the channel 7 share for that evening (up to the second decimal place). Figure 12.4 shows the distribution of the logit shares for can5 obtained using this transformation. It exhibits a rather good approximation of the Gaussian curve, which is now very close to the ker- nel density estimate. The goodness of the Gaussian approximation for the logit shares can be deduced in other ways; for instance, the qq-plot after the transfor- mation gives a remarkable goodness of ﬁt. Figure 12.5 compares the qq-plot for the share of can5 with the qq-plot for the logit share of can5. Logit shares help us with the statistical analysis but then they can be converted back into shares by using the following inverse transformation: shareit = 100 exp yit 1 + 6 j=1 exp yjt for i = 1,...,6 which resembles the softmax activation function used in neural networks mod- elling (Section 4.6). FORECASTING TELEVISION AUDIENCE 335 (a) (b) Figure 12.5 The qq-plots for (a) shrcan5 and (b) logitshrcan5. Figure 12.6 Scatterplot of the logit shares. 336 APPLIED DATA MINING Table 12.8 Correlation matrix of the logit shares. Table 12.9 Partial correlation matrix of the logit shares. logit1 logit2 logit3 logit4 logit5 logit6 logit1 1 0.025 0.126 0.096 0.177 0.109 logit2 1 0.257 0.146 0.172 0.177 logit3 1 0.181 0.052 0.145 logit4 1 0.058 0.208 logit5 1 0.255 logit6 1 On the basis of the available logit shares, Section 12.4 considers a number of predictive models. Often, for brevity, a logit share will be indicated with li, with i = 1,...,6. First we consider the correlation structure among the logit shares, which helps us to understand the structure of the Italian television market. Figure 12.6 shows the scatterplot of the observed logit shares, and for each logit share variable it also shows the minimum and maximum observations. As expected, there is a degree of correlation between the logit shares. To reach a ﬁrmer conclusion, Table 12.8 shows the correlation coefﬁcients between logit shares. There are relevant correlations between logit shares, not only for the minor channels (l2,l3,l4,l6), as can be expected, but also for the can5 logit share (l5). Table 12.9 shows the partial correlation matrix, which leads us to more accurate results on the interdependency structure between the logit shares (and in turn between the shares, as the channel 7 share is quite stable). The strongest partial correlations occur within each of the two networks: between can5 and ita1 (0.255), between rai2 and rai3 (0.257) and between rete4 and ita1 (0.208). Between networks, the strongest competition is between the two lead- ing channels, rai1 and can5 (0.177), and also between rai2 and rete4 (0.146), rai2 and can5 (0.172), rai2 and ita1 (0.177), rai3 and rete4 (0.181), and rai3 and ita1 (0.145). These results conﬁrm that the two leading networks com- pete with each other, with can5 also competing with others, in particular with rai2 and ita1. The other channels are quite interdependent and cannot sim- ply be paired across networks; in other words, they are more exposed to the competition. FORECASTING TELEVISION AUDIENCE 337 12.4 Model building We consider two types of models that differ in their aims. First we consider a single-target model, in which the logit share (hence the share) of one channel is to be predicted, on the basis of all programme types, the day of the week and the total audience. We choose to predict can5 as it is more subject to competition. We then build a multiple-target predictive model, in which all logit shares are predicted using all the explanatory variables. So we can compare models of different kinds, we use cross-validation from the outset. We randomly partition the data set into a training data set (75% of the data) and a validation data set (25% of the observations). In this section we show, for each class of models, how to obtain a good representative model that minimises the validation error, then we compare the representative models in Section 12.5. Our initial aim is to predict the logit share of can5, on the basis of the 30 binary explanatory variables that describe programme types, the 7 binary variables that describe the day of the week, and the continuous variable that describes the audience. We begin with a linear model. In linear models, to avoid linear depen- dency problems, we omit one binary variable per channel and one for the days of the week. To select a parsimonious linear model, we follow a stepwise model selec- tion procedure, based on pairwise model comparison F tests (Section 5.3). We choose a signiﬁcancy level of 0.05. Table 12.10 summarises the stepwise proce- dure. The ﬁrst variable that enters the model is the standardised audience. The correlation between the two is 0.3179, rather high for our data. The other sig- niﬁcant variables inserted in the ﬁnal model include whether Show programmes and Sport programmes are on can5, and variables related to programme types on other channels, in particular, whether ita1 is broadcasting a Film programme or a Tv-Movie programme. It also matters whether or not rete4 broadcasts a Film programme or an Information programme, whether or not rai1 is broadcasting an information programme, and whether or not it is a Thursday. Table 12.10 Results from the stepwise linear model selection procedure. Step Effect Entered Number DF In F Prob > F 1 std(audience) 1 1 31.1124 <.0001 2 show5 1 2 29.2278 <.0001 3 film6 1 3 20.4956 <.0001 4 sport5 1 4 17.8170 <.0001 5 film4 1 5 7.0417 0.0084 6 information1 1 6 8.8123 0.0033 7 information4 1 7 4.4485 0.0359 8 thursday 1 8 5.2413 0.0228 9tv−movie6 1 9 4.1858 0.0418 10 show3 1 10 5.4349 0.0205 338 APPLIED DATA MINING Table 12.11 Parameter estimates with the linear model. Standard Analysis of Parameter Estimates Standard Parameter DF Estimate Error t Value Pr > |t| Intercept 1 0.5582 0.0696 8.03 <.0001 AUDI−BV1 1 0.1481 0.0186 7.97 <.0001 film4 1 0.1621 0.0426 3.81 0.0002 film6 1 0.3047 0.0629 4.85 <.0001 information1 1 0.2017 0.0616 3.27 0.0012 information4 1 0.2032 0.0753 2.70 0.0074 show3 1 −0.1972 0.0846 −2.33 0.0205 show5 1 0.2249 0.0387 5.81 <.0001 sport5 1 0.3562 0.0780 4.57 <.0001 thursday 1 −0.1367 0.0543 −2.52 0.0125 tv−movie6 1 0.1498 0.0638 2.35 0.0196 To interpret these results, we look at the estimated linear coefﬁcients in Table 12.11. It turns out that the logit share of can5 depends positively (in order of magnitude) on sports5, film6, show5, information4, informa- tion1, film4, audience and tv movie6. It depends negatively on show3 and thursday. In other words, the estimated share of can5 is expected to increase when can5 broadcasts Sport programmes or Show programmes; when ita1 broadcasts Film programmes or Tv-Movie programmes; when rai1 broad- casts Information programmes; when rete4 broadcasts Information programmes or Film programmes. Furthermore, the logit shares increase with the audiences, by about 1.15 share points. Conversely, the estimated shares are expected to decrease when it is Thursday, or when rai3 broadcasts a Show programme. Overall this coincides with Figure 12.1: can5 takes most of its audience from Sports pro- grammes and Show programmes, and increases it when others broadcast Films programmes or Information programmes, especially the main competitors, rai1 and ita1. The estimated intercept is quite high, and this corresponds to a strong effect of loyalty to the channel. Next we ﬁt a regression tree to the training data. We choose a CART tree using the variance as a measure of impurity and the mean square error in the validation set as the pruning criterion. A more parsimonious model is a regression tree similar to the CHAID tree but using the F test for model splitting, instead of the chi-squared test. But the results were poorer in terms of mean square error on the validation data set and they are not reported here. Figure 12.7 shows the mean square error of the CART regression tree as the number of leaves increases. The optimal number of leaves is reached when the mean square error is a minimum on the validation data set; it occurs at 22. The top node of the tree speciﬁes that the mean logit share of the training data (274 observations) is equal to about 0.99. This corresponds to a mean share of FORECASTING TELEVISION AUDIENCE 339 Figure 12.7 Behaviour of MSE and choice of tree conﬁguration. 8e0.99 = 21.52, according to the deﬁnition of the logit share, and assuming a mean share for the residual channels equal to 8%. Similarly for the validation data set. The tree is ﬁrst split according to whether the standardised audience is essentially positive (right node) or negative (left node). In the positive case, the resulting node has a higher mean share, about 8e1.08 = 23.55; in the negative case it has a lower mean share, about 8e0.82 = 18.16. A second split is done, for both nodes, according to whether or not, can5 is broadcasting a Show programme. The worst-case scenario is described by the leftmost node (no show and low audience), in which the expected share is about 8e0.69 = 15.95. So far the tree results essentially agree with those from the linear model. The whole tree, which completely describes the 22 rules leading to the terminal nodes, is given in Table 12.12. For each rule, it shows the number of observations (total) classiﬁed to be there, as well as the mean and standard deviation of the logit share for the observations in the node. It is difﬁcult to compare the signs of the effects with the signs in the linear model, as the two models are rather different in structure. We can draw conclusions by comparing the discriminant variables. Most of the variables detected with the linear model also appear in the tree model, yet in a hierarchical form. This is the case for audience, show5, sport5, information1, film6 and tv movie6. Other variables, such as film4, information4, show3 and thursday, do not appear here. Other effects related to the same competing channels do appear: sport3, information3, film3, show4 and sport4,aswellasfilm2 and film1. Finally, friday, saturday and sunday appear. The overall structure of the tree seems more compatible with the image of can5 and with those of its correlated competitors (according to the correlation structure in Section 12.3). 340 APPLIED DATA MINING Table 12.12 Classiﬁcation rules deduced from the CART regression tree. IF SHOW5 EQUALS 0 AND standardize(AUDIENCE) < -0.059023893 THEN N:44 AVE : 0.69386 SD : 0.25358 IF TV−MOVIE6 EQUALS 0 AND SHOW5 EQUALS 1 AND standardize(AUDIENCE) < -0.059023893 THEN N:25 AVE : 1.02851 SD : 0.29171 IF SATURDAY EQUALS 0 AND SHOW5 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) THEN N:43 AVE : 1.14185 SD : 0.29216 IF INFORMATION1 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) < 0.6803605603 AND SHOW5 EQUALS 0 THEN NODE : 25 AVE : 1.43764 SD : 0.23178 IF SPORT3 EQUALS 1 AND SATURDAY EQUALS 1 AND SHOW5 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) THEN N:1 AVE : 0.82832 SD : 0 IF standardize(AUDIENCE) < -0.07021096 AND INFORMATION3 EQUALS 1 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 THEN N:7 AVE : 0.655 SD : 0.10062 FORECASTING TELEVISION AUDIENCE 341 Table 12.12 (continued) IF -0.07021096 <= standardize(AUDIENCE) < -0.059023893 AND INFORMATION3 EQUALS 1 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 THEN N:1 AVE : 0.82146 SD : 0 IF SHOW4 EQUALS 1 AND INFORMATION1 EQUALS 0 AND -0.059023893 <= standardize(AUDIENCE) < 0.6803605603 AND SHOW5 EQUALS 0 THEN N:8 AVE : 0.55035 SD : 0.26191 IF FRIDAY EQUALS 0 AND SPORT5 EQUALS 0 AND 0.6803605603 <= standardize(AUDIENCE) AND SHOW5 EQUALS 0 THEN N:61 AVE : 1.09677 SD : 0.23145 IF FRIDAY EQUALS 1 AND SPORT5 EQUALS 0 AND 0.6803605603 <= standardize(AUDIENCE) AND SHOW5 EQUALS 0 THEN N:3 AVE : 1.38847 SD : 0.15933 IF 0.6803605603 <= standardize(AUDIENCE) < 0.9634971743 AND SPORT5 EQUALS 1 AND SHOW5 EQUALS 0 THEN N:5 AVE : 1.09189 SD : 0.28865 IF FILM3 EQUALS 1 AND SPORT3 EQUALS 0 AND SATURDAY EQUALS 1 (continued overleaf) 342 APPLIED DATA MINING Table 12.12 (continued) AND SHOW5 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) THEN N:4 AVE : 1.41927 SD : 0.16031 IF SPORT4 EQUALS 0 AND FILM2 EQUALS 0 AND INFORMATION3 EQUALS 0 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 AND standardize(AUDIENCE) < -0.059023893 THEN N:9 AVE : 0.80977 SD : 0.06078 IF SPORT4 EQUALS 1 AND FILM2 EQUALS 0 AND INFORMATION3 EQUALS 0 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 AND standardize(AUDIENCE) < -0.059023893 THEN N:1 AVE : 1.17222 SD : 0 IF SUNDAY EQUALS 0 AND FILM2 EQUALS 1 AND INFORMATION3 EQUALS 0 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 AND standardize(AUDIENCE) < -0.059023893 THEN N:5 AVE : 0.96288 SD : 0.1526 IF SUNDAY EQUALS 1 AND FILM2 EQUALS 1 AND INFORMATION3 EQUALS 0 AND TV−MOVIE6 EQUALS 1 AND SHOW5 EQUALS 1 AND standardize(AUDIENCE) < -0.059023893 FORECASTING TELEVISION AUDIENCE 343 Table 12.12 (continued) THEN N:2 AVE : 1.24541 SD : 0.21291 IF FILM6 EQUALS 0 AND SHOW4 EQUALS 0 AND INFORMATION1 EQUALS 0 AND -0.059023893 <= standardize(AUDIENCE) < 0.6803605603 AND SHOW5 EQUALS 0 THEN N:23 AVE : 0.72516 SD : 0.27712 IF FILM6 EQUALS 1 AND SHOW4 EQUALS 0 AND INFORMATION1 EQUALS 0 AND -0.059023893 <= standardize(AUDIENCE) < 0.6803605603 AND SHOW5 EQUALS 0 THEN N:17 AVE : 0.95977 SD : 0.23193 IF 0.9634971743 <= standardize(AUDIENCE) < 1.1544846418 AND SPORT5 EQUALS 1 AND SHOW5 EQUALS 0 THEN N:2 AVE : 1.71799 SD : 0.13672 IF 1.1544846418 <= standardize(AUDIENCE) AND SPORT5 EQUALS 1 AND SHOW5 EQUALS 0 THEN N:2 AVE : 2.05309 SD : 0.30578 IF FILM1 EQUALS 0 AND FILM3 EQUALS 0 AND SPORT3 EQUALS 0 AND SATURDAY EQUALS 1 AND SHOW5 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) (continued overleaf) 344 APPLIED DATA MINING Table 12.12 (continued) THEN N:5 AVE : 1.6396 SD : 0.0795 IF FILM1 EQUALS 1 AND FILM3 EQUALS 0 AND SPORT3 EQUALS 0 AND SATURDAY EQUALS 1 AND SHOW5 EQUALS 1 AND -0.059023893 <= standardize(AUDIENCE) THEN N:2 AVE : 1.39445 SD : 0.00816 Finally, we come to feedforward neural networks. We design a multilayer per- ceptron with one hidden layer. After some experimentation we choose to have two hidden nodes, which can be deemed to represent the public and private tele- vision networks. We choose linear combination functions and linear activation functions for the two hidden nodes. We train the network using mean square error on the validation data set. We then try an RBFs network, to capture the neighbour topology among the evenings. We choose one hidden node and make the combination function from the input nodes to the hidden node a (Gaussian) radial basis function, with equal widths and equal heights. We choose the iden- tify function for the activation functions. The neural network models appear to perform rather similarly; the multilayer perceptron is slightly better but the RBF network is more parsimonious. Table 12.13 shows the weights calculated using the multilayer perceptron. There are 72 weights, even though there are only 38 explanatory variables, the audience plus 37 binary variables. This occurs because rai3 has no Tv-Movie programmes, so the corresponding binary variable is dropped. Similarly, ita1 transmits information only once and is therefore dropped. There are also two intercept parameters (bias), one for each input, as well as one from the hidden node to the output node. Unlike with linear models, there is no need to drop further binary variables. On the other hand, there is no simple formal model selection procedure, or hypothesis testing procedure on the weights, so we can only make qualitative comments about the signs and magnitudes of the weights. Notice that the estimated weight from the hidden nodes H11,H12 to the output node is small but positive. It therefore makes sense to interpret the effect of the explanatory variables by looking at the sign of the coefﬁcients from the input nodes to the hidden node. Also notice that the bias terms have rather high values, and this corresponds to the channel loyalty effect seen for the linear model. Most of the signs in the linear model match the signs in this model. FORECASTING TELEVISION AUDIENCE 345 Table 12.13 Weights of the multilayer perceptron network. From To Weight 1 AUDI−BV1 H11 1.1740347772 2 AUDI−BV1 H12 0.123134213 3 FILM11 H11 0.2168710734 4 FILM21 H11 −0.003614915 5 FILM31 H11 0.1618147995 6 FILM41 H11 0.3251317677 7 FILM51 H11 −0.900685697 8 FILM61 H11 0.9334017087 9 FRIDAY1 H11 −0.091669461 10 INFORMATION11 H11 0.4018886565 11 INFORMATION21 H11 0.101447949 12 INFORMATION31 H11 0.2979551735 13 INFORMATION41 H11 0.4086658636 14 INFORMATION51 H11 0.0348085575 15 MONDAY1 H11 0.2211611959 16 SATURDAY1 H11 0.4637795148 17 SHOW11 H11 −0.178213926 18 SHOW21 H11 0.1589792454 19 SHOW31 H11 −0.420967523 20 SHOW41 H11 −0.292000262 21 SHOW51 H11 0.1495358451 22 SHOW61 H11 −0.302074523 23 SPORT11 H11 −0.367628308 24 SPORT21 H11 −0.088506097 25 SPORT31 H11 0.468554604 26 SPORT41 H11 0.3149658113 27 SPORT51 H11 1.0365237425 28 SPORT61 H11 −0.861727181 29 SUNDAY1 H11 0.0581028079 30 THURSDAY1 H11 −0.425886049 31 TUESDAY1 H11 0.2672841044 32 TV−MOVIE11 H11 −0.290958077 33 TV−MOVIE21 H11 −0.180243694 34 TV−MOVIE41 H11 −0.308903067 35 TV−MOVIE51 H11 −0.474061343 36 TV−MOVIE61 H11 0.1132854614 37 WEDNESDAY1 H11 −0.632832395 38 FILM11 H12 −0.055731218 39 FILM21 H12 −0.083121848 40 FILM31 H12 −0.073921865 41 FILM41 H12 −0.144347594 42 FILM51 H12 0.0200435139 43 FILM61 H12 −0.0349771 44 FRIDAY1 H12 0.2470090922 (continued overleaf) 346 APPLIED DATA MINING Table 12.13 (continued) From To Weight 45 INFORMATION11 H12 0.1106999195 46 INFORMATION21 H12 −0.198466293 47 INFORMATION31 H12 −0.03354677 48 INFORMATION41 H12 0.4014440409 49 INFORMATION51 H12 0.0197488302 50 MONDAY1 H12 −0.226134652 51 SATURDAY1 H12 −0.246060203 52 SHOW11 H12 0.0688079194 53 SHOW21 H12 −0.067446329 54 SHOW31 H12 −0.472039866 55 SHOW41 H12 0.0018527931 56 SHOW51 H12 0.0462647232 57 SHOW61 H12 −0.220152292 58 SPORT11 H12 −0.220443076 59 SPORT21 H12 −0.051749183 60 SPORT31 H12 0.2625010151 61 SPORT41 H12 −0.035962957 62 SPORT51 H12 0.2103932874 63 SPORT61 H12 0.1457104861 64 SUNDAY1 H12 0.1676716269 65 THURSDAY1 H12 −0.169547437 66 TUESDAY1 H12 0.064402756 67 TV−MOVIE11 H12 0.1610823216 68 TV−MOVIE21 H12 −0.164204013 69 TV−MOVIE41 H12 0.220171918 70 TV−MOVIE51 H12 −0.138515633 71 TV−MOVIE61 H12 0.0168145323 72 WEDNESDAY1 H12 −0.137997477 73 BIAS H11 −0.099578697 74 BIAS H12 1.0532654311 75 H11 LOGIT5 0.0661926904 76 H12 LOGIT5 0.0489219573 77 BIAS LOGIT5 0.864583892 We ﬁnally ﬁt a nearest-neighbour model. For K, the number of neighbour- hoods to consider in the predictions, we take different multiples of 7, coinciding with a week in our data set. This is because programmes repeat in weekly sched- ules. The best conﬁguration, in terms of mean square error, occurs for K = 28, coinciding almost with a month. In principle, all the previous models can be extended to the multiple-target case, but most general-purpose data mining packages, such as SAS Enterprise Miner, do not support tree models or nearest-neighbour models with multiple FORECASTING TELEVISION AUDIENCE 347 targets. The next section compares the following three models, to predict the 6 response variables: • A linear model with 37 + 1 explanatory variables (the binary programme variables plus the audience) selected through stepwise model selection with a signiﬁcance level of 0.05. • A multilayer perceptron network with 37 + 1 input variables, one hidden layer with two hidden nodes, linear combination functions and linear activa- tion functions. • An RBF network with 37 + 1 input variables, one hidden node, and a radial basis combination function from the input nodes to the hidden node, with equal widths and heights. 12.5 Model comparison We begin with the single-target predictive problem. The models in the previous section are compared using two main measures of performance. The ﬁrst one is the mean square error (MSE) of the predictions, namely, the mean of the differences between the observed and the predicted logitshares. We make this measurement on the training date set and the validation data set. The validation data set is obviously the most important, but the training data set can give an assessment of the model’s goodness of ﬁt. The second one is the correlation coefﬁcient between the observed and predicted quantities. It measures errors in a slightly different way. Table 12.14 shows model comparison when can5 is considered as the response target variable, for the models considered in the last section. In terms of MSE over the training data set, the CART regression tree model comes ﬁrst, followed by linear regression and the two network models. The nearest-neighbour model is clearly worse than the others on goodness of ﬁt. In terms of MSE for predictions, the CART tree is again the best model, followed by the two network models and the linear regression model. The nearest-neighbour model is again the worst. The differences between the models are indeed slight, apart from the nearest- neighbour model. For this data set, the neighbouring structure over the input Table 12.14 Summary of the univariate predictive models. Model MSE training data set MSE validation data set Correlation between observed and predicted Linear regression 0.078 0.116 0.4525 CART tree 0.060 0.104 0.5066 RBF network 0.083 0.108 0.4889 MLP network 0.083 0.105 0.4933 Nearest neighbour 0.105 0.123 0.3260 348 APPLIED DATA MINING Figure 12.8 Observed versus predicted scatterplot for the CART tree (correlation = 0.5066). variables of the different evenings is not of great help. This slightly affects the RBF network too, which is slightly inferior than the multilayer perceptron. Net- works do better than the regression model, as they have probably been trained accurately, by minimising MSE over the validation data set. It may be sensi- ble to see how they perform comparatively on a test data set, if sufﬁcient data is available. Comparison of the correlation coefﬁcients, over the validation data set, reﬂects quite well the previous ranking of the models. Figure 12.8 is a scat- terplot that compares the observed and predicted logit shares for the best model, namely, the CART tree model. We now compare models for multivariate target prediction. We compare MSEs in terms of shares, not logit shares. The two MSEs are strictly related. However, there is variability in the relationship due to the volatility of the residual shares, and this effect is stronger in the multivariate case. That is why we refer directly to shares, which we obtain by applying the inverse logistic transformation to the predicted logit shares. Table 12.15 is a summary comparison of the MSEs of the predicted shares, on the validation data set only, for each of the six channels, as well as the overall MSE obtained by averaging all squared errors. It appears that, at the overall level, the linear model performs best, followed by the MLP network and ﬁnally the RBF network. Differences appear more marked on a share scale than on a logit share scale. In particular, the linear model does considerably better for the public television channels, probably because they have a more loyal public, easier to predict in a linear way. A mean error of around 5% is considered rather good by the television market experts. FORECASTING TELEVISION AUDIENCE 349 Table 12.15 Summary of the multivariate predictive models expressed as MSE of the shares on the validation data set. Model rai1 rai2 rai3 rete4 can5 ita1 overall Linear network 7.95 4.43 4.74 1.90 4.94 2.81 4.84 MLP network 9.36 5.28 5.20 1.91 5.07 2.88 5.48 RBF network 9.72 5.27 5.15 1.97 5.73 2.93 5.68 Table 12.16 Comparison between observed (centre column) and predicted quantiles (right column). Quantile 100% Max 41.7 30.9409 99% 41.7 29.5403 95% 31.9 28.4538 90% 29.6 27.4046 75% Q3 25.8 25.4045 50% Median 21.2 23.0174 25% Q1 17.8 20.8245 10% 14.8 18.1678 5% 12.4 17.0909 1% 5.3 14.2822 0% Min 5.3 14.2822 We then calculate the mean of all (absolute) correlations between observed and predicted shares, for each considered channel. It is 0.4739 for the linear model, 0.226 for the MLP network and 0.186 for the RBF network. This conﬁrm the previous ﬁndings, especially the large superiority of the linear model over the neural network models. Neural networks require more data to be trained well on such a highly complex model. The data set contains some anomalous observations, particularly for the public channels. The presence of outliers can have two effects: when they are allocated to the training data set, they bias the model ﬁt; when they are in the validation data set, they inﬂate the MSE. Both occur in this case study, especially for rai1. To give a clearer picture of these effects, Table 12.16 compares the quantiles of the observed distribution and the predicted distribution of the shares of can5, under the best model, namely, the linear model. Notice how the model compresses the distribution, considerably diminishing the range. Outliers cannot be eliminated from this data set, especially because they may contain precious information on share patterns. For completeness, however, we omit the outliers and obtain about a 1% improvement in MSE over the shares. To summarise, we have been able to build a valid predictive model for the data; this model uses information on the types of programmes broadcast by all channels, the total audience and, to a lesser extent, the date of broadcast (day 350 APPLIED DATA MINING of the week). The Italian television market appears to exhibit a strong channel loyalty, reﬂected in the high intercept (and biases) estimated by the models. While for single response problems a regression tree seem to be the best model, followed by a neural network model, for the multiple response case a simpler linear model may be considered a very good choice. 12.6 Summary report • Context: this case study concerns forecasting television shares. It may also be applied to any situations where the objective is to predict aggregate individual preferences. Here preferences were measured through the chosen television channel; more generally, this type of setting applies to any context where the data reﬂects consumer choices among a set of alternatives, observed repeat- edly in time. Examples are choices between internet portals, videotapes or DVD rentals in a given period; brand choices in subsequent visits to a spe- cialised shop; choice of restaurant in a given city area, in a given year, etc. • Objectives: the aim of the analysis is to build a predictive rule which allows a television network to broadcast programmes which maximise audience share. • Organisation of the data: the data is one year of television shares for the six leading Italian channels during prime time. Besides shares, there is informa- tion on the programme broadcast and its type, as well as on the broadcasting channel and the day of transmission. The type of a programme depends on how programmes are classiﬁed in categories; this is a fairly critical issue. • Exploratory data analysis: this suggested that television shares are affected mainly by three sources of variation: the broadcasting channel, which express loyalty to the channel, the type of programme, which seems to be the driving force of individual preferences; and the day of the week, which determines what else is available to the viewers, besides watching television. This also explains why it is important to include the total audience in the analysis. The exploratory analysis also suggested that we should transform the shares into logit shares to achieve normality and lead to an easier analysis. • Model speciﬁcation: the objective of the analysis suggests a predictive model, and the available (transformed) data speciﬁes that there are six potential response variables (logit shares) and a number of explanatory variables, some of which are channel speciﬁc, such as type of programme, and some not, such as day of the week and total audience. We considered predicting a single channel share and all six shares simultaneously. For the univariate problem, we considered a linear regression model, a regression tree, a mul- tilayer perceptron, an RBF network and a nearest-neighbour model. For the multivariate problem, we considered a linear regression model, a multilayer perceptron and an RBF network. Multi-response regression trees and nearest- neighbour models were not available. • Model comparison: the models were compared using cross-validation, in terms of mean square error (MSE) of the predictions, on the training data FORECASTING TELEVISION AUDIENCE 351 set and the validation data set. We also considered the correlation coefﬁcient between the observed share and the predicted share. In the univariate case, the regression tree performs best, followed by the linear model, the neural networks and the nearest-neighbour model. In the multivariate case, the linear model seems to outperform the neural network models, probably because the neural networks require more data. • Model interpretation: on the basis of model comparison, it seems that simpler models, such as linear models and regression trees, do the best job for this problem. This is generally true when the available data is not sufﬁcient to obtain correct estimates for the very large number of parameters contained in a more complex model. An overparameterised model, such as a neural network, may adapt well to the data, but its estimates may be based on very few data points, giving a rather poor predictive behaviour. This problem is further emphasised when outliers are present in the data. In this setting they cannot be removed as they may be very important for model building. In terms of business interpretability, the linear model and the regression tree (for the univariate response case) give an understandable decision rule, analytic in the case of linear models and logically deductive in the case of trees. In this type of problem, it is very important to incorporate expert judgements, such as in an expert-driven classiﬁcation of programme types. Bibliography Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I. (1995) Fast dis- covery of association rules. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Cambridge MA. Agresti, A. (1990) Categorical Data Analysis. John Wiley & Sons, Inc., New York. Akaike, H. (1974) A new look at statistical model identiﬁcation. IEEE Transactions on Automatic Control 19, 716–723. Azzalini, A. (1992) Statistical Inference: An Introduction Based on the Likelihood Princi- ple. Springer-Verlag, Berlin. Barnett, V. (1975) Elements of Sampling Theory. Arnold, London. Benzecri, J. (1973) L’analyse des donn´ees. Dunod, Paris. Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory. John Wiley & Sons, Inc., New York. Berry, M. and Linoff, G. (1997) Data Mining Techniques for Marketing, Sales, and Cus- tomer Support. John Wiley & Sons, Inc., New York. Berry, M. and Linoff, G. (2000) Mastering Data Mining. John Wiley & Sons, Inc., New York. Berry, M. A. and Linoff. G. (2002) Mining the Web: Transforming Customer Data. John Wiley & Sons, Inc., New York. Berson, A. and Smith, S. J. (1997) Data Warehousing, Data Mining and OLAP.McGraw- Hill, New York. Bickel, P. J. and Doksum, K. A. (1977) Mathematical Statistics. Prentice Hall, Englewood Cliffs NJ. Bishop, C. (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Blanc, E. and Giudici, P. (2002) Statistical methods for web clickstream analysis. Statis- tica Applicata, Italian Journal of Applied Statistics 14(2). Blanc, E. and Tarantola, C. (2002) Dependency networks for web clickstream analysis. In Data Mining III, Zanasi, A., Trebbia, C. A., Ebecken, N. N. F. and Melli, P. (eds). WIT Press, Southampton. Bollen, K. A. (1989) Structural Equations with Latent Variables. John Wiley & Sons, Inc., New York. Breiman, L., Friedman, J. H., Olshen, R. and Stone, C. J. (1984) Classiﬁcation and Regres- sion Trees. Wadsworth, Belmont CA. Brooks, S. P., Giudici, P. and Roberts, G. O. (2003) Efﬁcient construction of reversible jump MCMC proposal distributions. Journal of the Royal Statistical Society, Series B 1, 1–37, with discussion. Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 354 BIBLIOGRAPHY Burnham, K. P. and Anderson, A. R. (1998) Model Selection and Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. and Zanasi, A. (1997) Discovering Data Mining: From Concept to Implementation. Prentice Hall, Englewood Cliffs NJ. Cadez, I., Heckerman, D. Meek, C, Smyth, P. and White, S. (2000) Visualization of nav- igation patterns on a web site using model based clustering. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston MA. Castelo, R. and Giudici, P. (2003) Improving Markov Chain model search for data mining. Machine Learning 50, 127–158. Chatﬁeld, C. (1996) The Analysis of Time Series: An Introduction. Chapman and Hall, London. Cheng, S. and Titterington, M. (1994) Neural networks: a review from a statistical per- spective. Statistical Science 9, 3–54. Christensen, R. (1997) Log-Linear Models and Logistic Regression. Springer-Verlag, Berlin. Cifarelli, D. M. and Muliere, P. (1989) Statistica Bayesiana. Iuculano editore, Pavia. Coppi, R. (2002) A theoretical framework for data mining: the “information paradigm”. Computational Statistics and Data Analysis 38, 501–515. Cortes, C. and Pregibon, D. (2001) Signature-based methods for data streams. Journal of Knowledge Discovery and Data Mining 5, 167–182. Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J. (1999) Probabilis- tic Networks and Expert Systems. Springer-Verlag, New York. Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies: Models, Analysis and Interpretation. Chapman and Hall, London. Cressie, N. (1991) Statistics for Spatial Data. John Wiley & Sons, Inc., New York. Darroch, J. N., Lauritzen, S. L. and Speed, T. P. (1980) Markov ﬁelds and log-linear models for contingency tables. Annals of Statistics 8, 522–539. Dempster, A. (1972) Covariance selection. Biometrics 28, 157–175. De Ville, B. (2001) Microsoft Data Mining: Integrated Business Intelligence for e-Commerce and Knowledge Management. Digital Press, New York. Diggle, P. J., Liang, K. and Zeger, S. L. (1994) Analysis of Longitudinal data. Clarendon Press, Oxford. Di Scala, L. and La Rocca, L. (2002) Probabilistic modelling for clickstream analysis. In Data Mining III, Zanasi, A., Trebbia, C. A., Ebecken, N. N. F. and Melli, P. (eds). WIT Press, Southampton. Dobson A. J. (1990) An Introduction to Generalized Linear Models. Chapman and Hall, London. Edwards, D. (1995) Introduction to Graphical Modelling. Springer-Verlag, New York. Efron, B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics 7, 1–26. Fahrmeir, L. and Hamerle, A. (1994) Multivariate Statistical Modelling Based on Gener- alised Linear Models. Springer-Verlag, Berlin. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R. (eds) (1996) Ad- vances in Knowledge Discovery and Data Mining. AAAI Press, New York. Frydenberg, M. and Lauritzen, S. L. (1989) Decomposition of maximum likelihood in mixed interaction models. Biometrika 76, 539–555. Gibbons, D. and Chakraborti, S. (1992) Nonparametric Statistical Inference. Marcel Dekker, New York. BIBLIOGRAPHY 355 Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (eds) (1996) Markov Chain Monte CarloinPractice. Chapman and Hall, London. Giudici, P. (1998) MCMC methods to determine the optimal complexity of a probabilistic network. Journal of the Italian Statistical Society 7, 171–183. Giudici, P. (2001a) Bayesian data mining, with application to credit scoring and bench- marking. Applied Stochastic Models in Business and Industry 17, 69–81. Giudici, P. (2001b) Data mining: metodi statistici per le applicazioni aziendali.McGraw- Hill, Milan. Giudici, P. and Carota, C. (1992) Symmetric interaction models to study innovation processes in the European software industry. In Advances in GLIM and Statistical Modelling, Fahrmeir, L. Francis, B., Gilchrist, R. and Tutz, G. (eds). Springer-Verlag, Berlin. Giudici, P. and Castelo, R. (2001) Association models for web mining. Journal of Knowl- edge Discovery and Data Mining 5, 183–196. Giudici, P. and Green, P. J. (1999) Decomposable graphical gaussian model determina- tion. Biometrika 86, 785–801. Giudici, P. and Passerone, G. (2002) Data mining of association structures to model con- sumer behaviour. Computational Statistics and Data analysis 38, 533–541. Giudici, P., Heckerman, D. and Whittaker, J. (2001) Statistical models for data mining. Journal of Knowledge Discovery and Data Mining 5, 163–165. Goodman, L. A. and Kruskal, W. H, (1979) Measures of Association for Cross Classiﬁ- cation. Springer-Verlag, New York. Gower, J. C. and Hand, D. J. (1996) Biplots. Chapman and Hall, London. Green, P. J., Hjort, N. and Richardson, S. (eds) (2003) Highly Structured Stochastic Sys- tems. Oxford University Press, Oxford. Greenacre, M. (1983) Theory and Applications of Correspondence Analysis. Academic Press, New York. Greene, W. H. (1999) Econometric Analysis. Prentice Hall, New York. Han, J. and Kamber, M. (2001) Data Mining: Concepts and Techniques. Morgan Kauf- mann, New York. Hand, D. (1997) Construction and Assessment of Classiﬁcation Rules. John Wiley & Sons, Ltd, Chichester. Hand, D. J. and Henley, W. E. (1997) Statistical classiﬁcation method in consumer scor- ing: a review. Journal of the Royal Statistical Society, Series A 160, 523–541. Hand, D. J., Mannila, H. and Smyth, P. (2001) Principles of Data Mining. MIT Press, Cambridge MA. Hand, D. J., Blunt, G., Kelly, M. G. and Adams, M. N. (2001) Data mining for fun and proﬁt. Statistical Science 15, 111–131. Hastie, T., Tibshirani, R. and Friedman, J. (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York. Heckerman, D. (1997) Bayesian networks for data mining. Journal of Data Mining and Knowledge Discovery 1, 79–119. Heckerman, D., Chickering, D., Meek, C., Rountwaite, R. and Kadie, C. (2000) Depen- dency networks for inference, collaborative ﬁltering and data visualisation. Journal of Machine Learning Research 1, 49–75. Hoel, P. G., Port, S. C. and Stone, C. J. (1972) Introduction to Stochastic Processes. Waweland Press, Prospect Heights IL. Immon W. H. (1996) Building the Data Warehouse. John Wiley & Sons, Inc., New York. Jensen, F. (1996) An Introduction to Bayesian networks. Springer-Verlag, New York. 356 BIBLIOGRAPHY Johnson, R. A. and Wichern, D. W. (1982) Applied Multivariate Statistical Analysis.Pren- tice Hall, Englewood Cliffs NJ. Johnston, J. and Di Nardo, J. (1997) Econometric Methods. McGraw-Hill, New York. Kass, G. V. (1980) An exploratory technique for investigating large quantities of categor- ical data. Applied Statistics 29, 119–127. Kloesgen, W. and Zytkow, J. (eds) (2002) Handbook of Data Mining and Knowledge Discovery. Oxford University Press, Oxford. Kolmogorov, A. N. (1933) Sulla determinazione empirica di una leggi di probabilita. Giornale dell’Istituto Italiano degli Attuari 4, 83–91. Lauritzen, S. L. (1996) Graphical Models. Oxford University Press, Oxford. Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press, London. McCullagh, P. and Nelder, J. A. (1989) Generalised Linear Models. Chapman and Hall, New York. Mood, A. M., Graybill, F. A. and Boes, D. C. (1991) Introduction to the Theory of Statis- tics. McGraw-Hill, Tokyo. Neal, R. (1996) Bayesian Learning for Neural Networks. Springer-Verlag, New York. Nelder, J. A. and Wedderburn, R. W. M. (1972) Generalized linear models. Journal of the Royal Statistical Society, Series B 54, 3–40. Parr Rudd, O. (2000) Data Mining Cookbook: Modeling Data for Marketing, Risk, and Customer Relationship Management. John Wiley & Sons, Ltd, Chichester. Quinlan, R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, New York. Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rosenblatt, F. (1962) Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanism. Spartan, Washington DC. SAS Institute (2001) SAS Enterprise Miner Reference Manual. SAS Institute Inc., Cary NC. Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics 62, 461–464. Searle, S. R. (1982) Matrix Algebra Useful for Statistics. John Wiley & Sons, Inc., New York. Thuraisingham, B. (1999) Data Mining: Technologies, Techniques and Trends. CRC Press, Boca Raton FL. Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading MA. Vapnik, V. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York. Vapnik, V. (1998) Statistical Learning Theory. John Wiley & Sons, Inc., New York. Weisberg, S. (1985) Applied Linear Regression. John Wiley & Sons, Inc., New York. Weiss, S. W. and Indurkhya, N. (1997) Predictive Data Mining: A Practical Guide.Mor- gan Kaufmann, New York. Westphal, C. and Blaxton, T. (1997) Data Mining Solutions. John Wiley & Sons, Inc., New York. Whittaker, J. (1990) Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, Ltd, Chichester. Witten, I. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Tech- niques with Java Implementation. Morgan Kaufmann, New York. Zadeh, L. A. (1977) Fuzzy sets and their application to pattern classiﬁcation and cluster- ing. In Classiﬁcation and Clustering, Van Ryzin, J. (ed.). Academic Press, New York. Zanasi, A. (ed.) (2003) Text Mining and Its Applications. WIT Press, Southampton. Zucchini, W. (2000) An introduction to model selection. Journal of Mathematical Psy- chology 44, 41–61. INDEX 0–1 distance 189–90 Absolute frequency 26, 257f, 258 Activation functions 108 Agglomerative clustering 77–8, 260, 261t Akaike information criterion (AIC) 194–5 Applied data mining 33 Apriori algorithm 126 Arbitrage pricing theory (APT) model 93 Arithmetic mean 36 Association, independence and 53–4 Association indexes 52, 57 Association rules 121–6, 209, 218–24, 305 Association (summary) measures 54 Asymmetric distribution 42f,43 Asymmetric generalised linear models 167 Asymmetry, measures of 41–3 Backward elimination procedure 150, 152, 153 Bagged-tree models 319, 320 Bagging and boosting method 198–9, 205 Bayes’ rule (Inversion) 132, 141–2, 195 Bayesian analysis methods 141–2 Bayesian criteria 195–6, 205 Bayesian networks 183, 184 Best-ﬁt models (saturated models) 160, 176 Bias 107, 194 Bias–variance trade-off 194 Binary response variables 291 Binomial distribution 156 Binomial link function 159 Biplots 127 Bivariate exploratory analysis 33, 45–9, 67 Bivariate linear regression 85–8 Bivariate normal distribution 49 Bootstrap criterion 198 Bottom-up analysis 6 Boxplots 41–2 Capital asset pricing model (CAPM) 88 Cardinality k of neighbourhood 120 CART algorithm 338 classiﬁcation rules 340–4t confusion matrix 286–7 division criteria 103–5 entropy impurity results 310–14t Gini impurity results 306–9t mean square error 338–9 pruning 105–7 tree assessments 104–5 Case studies credit scoring 293–321 customer relationship management 273–91 market consumer behaviour 209–27 TV share prediction 323–51 website visitor behaviour prediction 229–53 website visitor proﬁles 255–72 CHAID algorithm 103, 107, 303, 305 impurity measure 105 results 303t, 304f Chi-squared distance 189 Applied Data Mining. Paolo Giudici 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth) 358 INDEX Chi-squared distribution 136 Chi-squared statistic 300 Churn analysis 273, 290 Classiﬁcation tree models, credit scoring 303–14 Classiﬁcation trees 100 vs. hierarchical cluster analysis 102 Cluster analysis 13, 70, 75–85 average linkage 79 choice of evaluation criteria 77 choice of variables 76 complete linkage 79 group formation method 76 method of the centroid 79 output example 83 proximity index type 76–7 single linkage 79 Ward’s method 80 website visitor proﬁles 258–62 Coefﬁcient of regression 95 Coefﬁcient of variation (CV) 37 Combination functions 108 Complement rule, probability 131 Computational data mining 69–128 Computing, data mining and 3–5 Concentration, measures of 39–41 Concentration coefﬁcient 57 Concentration curves 40 Concordance 47 Conditional independence 171, 186 Conditional probability (conﬁdence) 252 Conﬁdence intervals 125, 138 Confusion matrix 200, 202 CART algorithm 286–7 logistic regression model 286 nearest-neighbour model 287, 288t RBF network 287 Contingency tables 25, 28t,29t,52t,53 Continuous quantitative variables 23 Correlation measures 54 Correspondence analysis 68 Covariance 47–8 Covariance selection models 179 Cramer index 55 Credit scoring 14, 293–321 analysis objectives 293–4 classiﬁcation tree models 303–14 data description 294–6 data organisation 320 exploratory data analysis 296–9, 320 loan speciﬁc variables 294–5 logistic regression models 299–303 model building 299–314 model comparison 314–19, 320 model interpretation 320–1 model speciﬁcation 320 multilayer perceptron models 314 objectives 320 odds ratio 295 personal and ﬁnancial variables 294 rules 320 socio-demographic variables 294 summary report 319–21 wealth indicators 295 Critical values 143 Cross-validation criterion 106, 197–8 Cubic clustering criterion (CCC) 259 Cumulative distribution functions 132 Customer characteristic variables 274 Customer loyalty 290 Customer relationship management 14, 19, 273–91 classiﬁcation tree models 281–5 data description 273–4 data organisation 291 exploratory data analysis 275–8, 291 logistic regression models 278–80 model building 278–86 model comparison 286–90 model interpretation 291 model speciﬁcation 291 nearest-neighbour models 285–6 objectives 273, 290–1 radial basis function networks 280–1 summary report 290–1 Data aggregated (divided) 30 classiﬁcation 22–3 transformation 29–30 types 13 Data analysis 9 Data integration 20 Data marts 7, 21, 22 Data matrices 7, 13, 19, 21, 23–5, 31 binarisation 25 data cleansing 7 Data mining aims 2, 7 applications 13–14 INDEX 359 creation phase 10 deﬁnition 1–6 implementation of methods 9–10 main methods 13 migration phase 10 models 13 necessary experts 10 process 6–10 software 11–12 and statistics 2, 5–6 strategic phase 10 training phase 10 Data mining methods evaluation 187–205 computational criteria 197–200, 205 loss functions criteria 200–5 scoring functions criteria 193–5, 205 statistical test criteria 188–93 Data organisation 7, 19–31, 226 Data reduction 68 Data retrieval 3 Data sources, identiﬁcation 7 Data structures 30–1 Data warehouses 7, 19, 20–1 Data webhouse 19, 21–2 Databases creation of ready-to-analyse 13 marketing applications 2 Datamart (marketing database) 274, 291 Decision trees 2, 13 Dendrograms 77–8, 260 Density functions 34 Dependency measures 56–8, 67 Dependent variables (response) 85 Descriptive statistical methods 8, 33 Deviance 160–1, 163 Deviance difference criterion G2 193 Dimensionality, reduction 61–6, 68 Direct sequences 126, 250 Directed acyclic graphs 183 Directed graphical models 244, 245f Discordance 47 Discrete graphical models 179 Discrete probability functions 133 Discrete quantitative variables 23 Discriminant analysis 98–9 Distance, measures 54–6, 67, 70–5 Distance functions 120, 188–90 Distribution functions 132 Divise clustering algorithms 77–8, 81 Elementary events 130 Enterprise Miner, SAS data mining software 11, 259, 300 Entropic distance 189 Entropy index 38, 57, 125 Error functions 113–14 predictive classiﬁcation 114 predictive regression 113–14 Error proportional reduction index (EPR) 57 Estimation methods, statistical inference 138 Euclidean distance 71–2, 74, 190 Events 130 Explanatory variables 275, 291 Exploratory data analysis 8, 13, 33–68 credit scoring 296–9, 320 customer relationship management 275–8, 291 forecasting television audience 327–36, 350 market basket analysis 212–15, 226 vs. data mining 33 web clickstream analysis 232–8 F distribution 136–7 F test 150 Factor analysis 67 False negatives, ROC curve 203 False positives, ROC curve 203 Feedback networks 111 Feedforward networks 111, 344 Fisher’s rule 99 Forward selection procedure 150, 299 Frequency diagrams 34 Frequency distributions 25–9 continuous quantitative variables 34 Fuzzy classiﬁcation 127 G2 statistic 189 Gaussian distribution 134–5 Generalised linear models 154–67 application 164–7 deﬁnition 157–63 exponential distribution 155–7 inferential results 159–60 link function 158–9 model comparison 160–3 random component 157 systematic component 157–8 360 INDEX Genetic algorithms 199–200 Gini concentration coefﬁcient 41 Gini concentration index 41 Gini impurity results 306–9t Gini index 57, 125, 189, 291 Gini index of heterogeneity 38 Gini index of performance 203, 204, 290, 315, 317, 318 Global Markov property 178 Goodness of ﬁt 90–1, 160, 320 Graphical Gaussian models 179 Graphical log-linear models 171–4 Graphical models 177–85 chain graphs 178 directed graphs 177, 178 symmetric 178–82 undirected graphs 177, 178 vs. neural networks 184–5 Heterogeneity index 37–9, 57 Heterogeneity reduction 189 Hierarchical cluster analysis vs. classiﬁcation trees 102 Hierarchical methods 77–81 evaluation 81–3 Histograms 34, 43, 145 Hypernormal distribution 44 Hyponormal distribution 44 Hypothesis testing 142–3 Identity functions 109 Impurity functions 103–4 Independence, and association 53–4 Independence of events 132 Independent variables 85 Index of dissimilarity 72 Index of similarity 72 Indirect sequences 126, 250 Indirect statistical methods 8 Inference, uncertainty measures and 129–43 Inferential statistical analysis 33, 43 Interquartile range (IQR) 37 Intersection rule 131 Inversion (Bayes’ rule) 132, 141–2, 195 k-fold cross-validation 198 K-means cluster means 263t k-means method 84, 85 k-nearest-neighbour model 120 Kernel density functions 127 Kernel estimators 145 Knowledge Discovery and Data Mining 15 Knowledge discovery in databases (KDD) 2 Kohonen maps 262–4, 265t Kohonen networks (self-organising maps) 70, 117–19 Kolmogorov–Smirnov statistics 144, 190 Kullback–Leibler discrepancy 192–3 Kurtosis, measures of 43–5 Learning algorithm, interrupting 115 Least squares principle 90 Least squares scaling 75 Leaves, tree terminal nodes 101 Leaving-one-out cross-validation method 198 Left asymmetric distribution 43 Lift charts 201, 202f, 316, 317f Likelihood ratio 161 Linear correlation coefﬁcient 48, 95 Linear discriminant analysis 99 Linear model, parameter estimates 338t Linear regression 70, 85–95 bivariate 85–8 goodness of ﬁt 90–1 multiple 91–5 multivariate 127 properties of the residuals 88–90 Link analysis 242–3 Local probabilistic models 121–7, 128 Local statistical methods 8–9 Location, measures of 35–7 Log-linear models 159, 167–77, 186, 215–18 advantage 227 application 175–7 comparison 174–5 construction 167–9 interpretation 169–71 properties 168–9 Logistic curves 97 Logistic discriminant rule 99 Logistic regression models 70, 96–9, 127, 159, 163–4, 169 confusion matrix 286 credit scoring 299–303 INDEX 361 customer relationship management 278–80 Logit shares 333, 334t correlation matrix 336t partial correlation matrix 336t Longitudinal data 30 Loss functions 314 Loyalty analysis 273 Machine learning 2 Market basket analysis 13, 121, 122, 209–27 association rules 218–24 data description 210–12 exploratory data analysis 226 model building 215–24 model comparison 224–6 model interpretation 227 model speciﬁcation 226 objectives 209, 226 summary report 226–7 Marketing databases (Datamart) 22, 274, 291 Markov chain model 251 Markov chain Monte Carlo (MCMC) techniques 196 Markov chains 245–50 Markov properties 178, 179f Maximum heterogeneity 38 Maximum likelihood methods 140–1 Maximum likelihood ratio test statistic 161 Mean 35, 138, 139 Mean contingency 55 Mean square error 347 Median 36 Memory-based reasoning models 70, 120, 127 Metadata 21 Method of the centroid, average linkage method and 79 Metric multidimensional scaling methods 75 Misclassiﬁcation errors (rate) 200, 288, 305 Mixed graphical models 179 Mixture models 146 Mode (modal value) 36 Model-based indexes 67 Model-based measures 58–61 Monothematic behaviours 269 MSCI WORLD 88 Multidimensional reporting tools 4 Multidimensional scaling methods 74–5, 127 Multilayer perceptrons 70, 128, 344 application 116–17 architecture choice 112–13 coding of the variables 112 credit scoring models 314 input variables dimensionality reduction 112 learning the weights 113 optimality properties 116 preliminary analysis 112 supervised network 111–17 transformation of the variables 112 weights calculated 344, 345–6t Multimedia data 30 Multiple linear regression 91–5 Multivariate exploratory analysis 33, 61 qualitative data 51–61 quantitative data 49–51 Multivariate frequency distributions 27–9 Multivariate linear regression 127 Multivariate normal distributions 99 Multivariate target prediction, model comparison 348 Naive Bayes model 184 Nearest-neighbour models 119–21, 128, 346 confusion matrix 287, 288t customer relationship management 285–6 Negative asymmetry 41 Neural networks 13, 70, 107–19, 314 architecture 109–11 literature 127 Nominal measurements 23 Non-hierarchical methods 83–5 Non-parametric models 102, 134, 143–6 Normal distribution 156 Normal linear models 146–54 application 150–4 main inferential results 147–50 Null heterogeneity 38 Null models (worst-ﬁt models) 160 362 INDEX Occurrence probability, (support) 252 Odds ratio 59–61, 209, 296 analysis 320 interpretation 298t univariate and multivariate comparison 302t Online analytical processing (OLAP) 4 Optimisation algorithm 113 choice of 114–15 Ordinal measurements 23 Ordinal variables, ranking 52t Outliers 29, 76 Overﬁtting problem 115 Pairwise Markov property 178 Parametric models 134, 143 Partial correlation coefﬁcient 51, 95 Pearson statistic 160, 189 Percentiles 37 Perceptron model 2 Phi coefﬁcient 55 Point estimate methods 138 Poisson distribution 155–6, 167 Polythematic behaviours 269 Positive asymmetry 41 Posterior distribution 141 Prediction, generalisation and 115–16 Predictive misclassiﬁcation rates 315 Predictive statistical methods 8 Principal component analysis (PCA) 34, 52, 67–8, 90 Principal component transformation 61 Principal components application 65–6 interpretation 63–5 Principle of universal approximation 116 Probabilistic expert systems 126, 182–4, 244–5, 250 Probability 130–2 statistical models 132–7 Probit models 127 Projection pursuit methods 68 Pseudo-F statistic 149 Qualitative data, multivariate exploratory analysis 51–61 Quantile-quantile plots (qq-plots) 44, 151, 335f Quantiles 37 Quantitative data, multivariate exploratory analysis 49–51 Quartiles 37 Query tools 3 Radial basis function (RBF) networks 127 confusion matrix 287 customer relationship management 280–1 forecasting television audience 347 Random variables, continuous 133 Range 37 Rao’s score statistic 160 Recursive graphical models 182–4 Regression functions 86 Regression trees 100, 338 entropy impurity 104 Gini impurity 104 misclassiﬁcation impurity 104 Reporting tools 3, 4 Residuals, deviance 163 Response variable qq-plot 151 univariate indexes 151 Retrieval-by-content models 126–7 Return on equity (ROE) 45, 46f Return on investment (ROI) 45, 46f Right asymmetric distribution 43 ROC (receiver operating characteristic) curves 202–4, 289, 291, 315, 320 Root mean square standard deviation (RMSSTD) 82 Sammon mapping 75 Sample mean 138, 139 Sample variance 138, 139 Sampling error 194 Saturated models (best-ﬁt models) 160, 176 Scatterplots 45 logit shares 335f Score functions 194 Scorecard models 293–4 Scree plots 65 Self-organising maps (Kohonen networks) 70, 117–19 Semiparametric models 134, 143 Semipartial R2 (SPRSQ) 82 SEMMA method 11–12 INDEX 363 Sensitivity, ROC curve 203 Sequence rules 126 Similarity index of Jaccard 74 Similarity index of Russel and Rao 73 Similarity index of Sokal and Michener 74 Similarity measures 72–4 Simple matching coefﬁcient 74 Simpson’s paradox 67 Single-tree models 319 Socio-demographic variables, response variable conditional distribution 275, 276t Softmax functions 109 Spearman correlation coefﬁcient 51 Speciﬁcity, ROC curve 203 Standard deviation 37 Statistical data mining 129–86 Statistical independence 53, 54 Statistical inference 137–43 Statistical methods evaluation 9 speciﬁcation 8–9 Statistical models discrepancy 190–2 distance between 188–90 probability 132–7 Statistical variables 23 Statistics data mining and 5–6 models 2 Stepwise procedure, linear model 150, 337 Stress functions 75 Student’s t distribution 136 Summary indexes, series 34 Summary measures 54 Supply chain management 19 Support vector machines 128 Symmetric distribution 42f,43 Symmetric generalised linear models 167 Symmetric graphical models 178–82 Symmetric statistical methods 8 Systematic error (bias) 194 Television audience distribution of total 332 measuring indicators 324 variation in total 330 Television audience forecasting 323–51 analysis objectives 323–4 data description 324–7 data organisation 350 exploratory data analysis 327–36, 350 linear model 347 model building 337–47 model comparison 347–50, 350–1 model interpretation 351 model speciﬁcation 350 multilayer perceptron network 347 objectives 350 RBF network 347 summary report 350–1 Television channels boxplots 328–9f channel shares 14, 324, 327 mean audience 324 programme type distribution 331t Reach (visitor numbers) 324 total audience 324 Television programmes classiﬁcation into groups 326 planning schedule 323 Test data set, customer relationship management 288 Text databases 30 Three-way matrices 30 Top-down analysis 6 Total probability rules 132 Trace (matrix overall variability) 50 Transition probabilities 248 Tree of hierarchical clustering (dendrogram) 77–8, 260 Tree models 70, 100–7, 127, 303–14, 320 association rules generation 124 customer relationship management 281–5 Two-way contingency tables 28t,29t, 52t,53 Type I errors 315, 318 Type II errors 315 Uncertainty coefﬁcient 57, 189 Uncertainty measures, and inference 129–43 Undirected graphical models 126 Uniform distance, distribution function 190 364 INDEX Union rule, probability 131 Univariate distributions 26–7 Univariate exploratory analysis 33, 34–45, 67 Univariate graphical displays 34 Unsupervised statistical methods 8 Value at risk (VaR), statistical index 134, 135 Variability, measures of 37 Variance 37, 50, 138, 139 Variance–covariance matrix 47, 50, 61 Wald chi-squared statistic 300 Wald’s test 159 Ward’s method 80 Wealth indicators 295 Web clickstream analysis 14, 121, 122, 229–53 context 252 data description 229–32 data organisation 252 exploratory data analysis 232–8, 252 link analysis 242–3 model building 238–50 model comparison 250–2 model interpretation 253 model speciﬁcation 252 navigation patterns 238 objectives 229, 252 sequence rules 238–42 summary report 252–3 Web data 30 Webhouse 19, 21–2 Website visitor proﬁles 14, 255–72 cluster analysis 258–62 context 271 data description 255–7 data organisation 271 discrete variables 256 exploratory data analysis 258, 271 model building 258–64 model comparison 264–71 model interpretation 272 model speciﬁcation 271 objectives 255, 271 summary report 271–2 visitor data matrix 257 Websites, data mining information 15 Weighted arithmetic mean 36 Wilks generalised variance 50 Worst-ﬁt models (null models) 160 Index compiled by Geraldine Begley
*

贡献于2015-09-30