, where loc is the cell spatial position inside its parent cell, n is the number of points in the cell, P[]is an array of half-space counts, usedCell is a boolean ﬂag and ptr is a pointer to the next tree level. For Halite, this cell structure was slightly modiﬁed. Here, the pointer ptr does not exist and loc has the absolute position for the cell. In each key/value pair, loc is the key, and the other attributes form the value. Figure 4.5 exempliﬁes the data storage for Halite. The tables shown consist in a different way of storing the Counting-tree of Fig. 4.3c. Both approaches represent the same data, the 2-dimensional dataset from Fig. 4.3b, the one used in the examples of Sect. 4.3. To reduce cluttering in the ﬁgure, the usedCell ﬂags are not shown. Notice, for example, that the cell A3 from Fig. 4.3b has a single point. This information is stored in both versions of the tree as A3.n = 1. The space position of A3 is given by A3.loc = [11 ↓ 00 ↓ 01] in Fig. 4.5. This information is found in Fig. 4.3c as [A1.loc ↓ A2.loc ↓ A3.loc] = [11 ↓ 00 ↓ 01]. Finally, the half-space counts are represented by A3.P[1]=1 and A3.P[2]=0 in both data structures. Provided the algorithm Halite0 and the fact that the tree can be stored in tables with key/value entries, the implementation of Halite is done as follows: use a traditional approach to store tables with key/value entries in main memory and/or in disk for the tree storage, and apply to Halite the same strategies used by Halite0, described in Algorithms 1, 2 and 3. Notice that, between the used data structures, the Counting- tree is the only one that may have large changes in size with regard to the input 4.4 Presented Method: The Algorithm Halite 49 level 1 loc PP nloc P[1] P[2] n [01] 1 4 6 [11] 1 3 3 level 3 loc PP nloc P[1] P[2] n ...... [11 00 01] 101 ...... level 2 loc PP nloc P[1] P[2] n [01 00] 001 [01 10] 113 [01 11] 112 [11 00] 101 [11 10] 222 Counting-tree A2 A3D2 B2 Fig. 4.5 The Counting-tree by tables of key/value pairs in main memory and/or in disk. It allows Halite to efﬁciently cluster large datasets, even when the tree does not ﬁt in main memory. Notice that, both this tree and the one in Fig. 4.3c represent the example dataset from Fig. 4.3b dataset. Thus, considering this table-based implementation, it is possible to afﬁrm that Halite never uses disk cache, regardless of the input dataset. The current implementation forHalitestores the Counting-tree by using the Oracle Berkeley DB 11g2 conﬁgured for simple data storage to avoid data locks. The cache size is set according to the available main memory. It currently has hash tables storing the key/value pairs, but other structures could also be used, such as B-trees. 4.5 Presented Method: Soft Clustering The algorithm Halite is a hard clustering method, i.e., it deﬁnes a dataset partition by ensuring that each point belongs to at most one cluster. Hard clustering methods lead to high quality results for most datasets. Also, several applications require the deﬁnition of a dataset partition. However, hard clustering is not the best solution for some speciﬁc cases, in which the clusters have high probability to overlap. Consider the 2-dimensional dataset in Fig. 4.6a. The data contain a pair of clusters that overlap, making any dataset partition not a good choice, since the points in light-gray should belong to both clusters. In cases like that, the so-called soft clustering methods are more appropriate, as they allow points in the overlapping spaces to belong to more than one cluster. For that reason, this section describes the Halite method for soft correlation clustering, a soft clustering approach for Halite. As a real example, let us consider the clustering analysis of satellite images. In this scenario, a topographer wants to analyze terrains in a set of images, usually assuming that each image is split into tiles (say, 32×32 pixels), from which features are extracted. The topographer expects to ﬁnd clusters of ‘water’ tiles, ‘concrete’ 2 www.oracle.com/technology/products/berkeley-db/ 50 4 Halite X Y X Y X Y (d) (a) (e) (f) (c) (b) X Y X Y X Y Fig. 4.6 Illustration of the soft clustering method Halites: β-clusters (dotted rectangles) may stay apart (top) if they are incompatible, resulting in soft clustering (light-gray circles in (a)); or merged together (bottom). The compression-based formulas of Halites automatically make the right choice. In all cases, a ‘star’ indicates the center of the respective clusters tiles, ‘forest’ tiles, etc., but the used procedure tends to create many hybrid tiles, as a bridge (both ‘water’ and ‘concrete’) or a park (‘water’, ‘forest’ and ‘concrete’), which should belong to more than one cluster. In other words, there is a high probability that the clusters overlap in the data space. Therefore, a hard clustering method is semantically inappropriate to the case. Halites is a fast and scalable algorithm carefully designed to spot points that should belong to two or more clusters, being recommended to scenarios where the probabil- ity of cluster overlap is high, as in our previous example with satellite imagery. The algorithm has three phases. The ﬁrst two are the same ones used for hard clustering: Algorithms 1 and 2, including the improvements presented in Sect. 4.4. The third phase is new. It takes β-clusters as input and uses a compression-based analysis to combine some of the ones that overlap into a soft clustering result. Figure 4.6 illus- trates the problem. Distinct 2-dimensional datasets are shown in Fig. 4.6a and d. Each dataset contains a pair of overlapping β-clusters, described by dotted rectangles. The β-clusters in Fig. 4.6a clearly form two clusters and, thus they should remain apart. However, the ones in Fig. 4.6d should be combined into a single cluster. Halites automatically makes the right choice in both cases. The full pseudo-code for the new phase three is shown in Algorithm 4. The idea is to use the Minimum Description Length (MDL) principle and to analyze the compression ratio of pairs of overlapping β-clusters, where the clusters in each pair 4.5 Presented Method: Soft Clustering 51 can be compressed apart or combined. Halites picks the option that compresses best. If the combined cluster allows better compression than the β-clusters do in separate, the β-clusters are merged; otherwise, the β-clusters remain apart, allowing points in the overlapping spaces to belong to both clusters. Algorithm 4 : Building soft clusters. Input: matrices of β-clusters L, U and V, number of β-clusters βk Output: set of correlation clusters C, number of correlation clusters γ k 1: for k = 1, 2, ...,β k do 2: for k = k + 1, k + 2, ...,β k do 3: if β-clusters δ β Ck and δ β Ck overlap then 4: sizek = compress(δ β Sk ), if not done yet; 5: sizek = compress(δ β Sk ), if not done yet; 6: sizek∪k = compress(δ β Sk ∪ δ β Sk ); 7: if (sizek + sizek )>sizek∪k then 8: // the combined cluster compresses best 9: merge β-clusters δ β Ck and δ β Ck into a single cluster, including merged clusters in further mergers, as it is done in Algorithm 3; 10: end if 11: end if 12: end for 13: end for The strategy used for compression is described in Algorithm 5, which refers to the function compress()used in Algorithm 4. The function receives a set of points δ γ Sk, related to a possible cluster δ γ Ck and returns the size of the input data in bits, after compression. Notice: in order to avoid additional disk accesses, δ γ Sk is approximated by the tree level where the respective β-clusters were found, assuming points in the center of the cluster’s cells. The compression is performed as follows. The principal components of δ γ Sk are computed, and the points of δ γ Sk are projected into all d principal components. Let the projected points be in a set δ γ Pk and the d principal components computed be in a set γ Ak. Notice that this step states that δ γ Pk and γ Ak deﬁne the possible cluster δ γ Ck as δ γ Ck = γ Ak, δ γ Pk after its projection into the principal components of δ γ Sk. Then, the maximum maxj and minimum minj values of δ γ Pk in each principal component aj ∈ γ Ak are found. After that, one can represent the input data δ γ Sk in a compressed way by storing: the descriptions of each principal component aj, besides their related values for maxj and minj, and; the differences to the center, pij −(((maxj −minj)/2)+minj), for each projected point pi ∈ δ γ Pk, with regard to each principal component aj ∈ γ Ak. The size in bits needed to store these items is the output. Large numbers need more bits to be stored than small numbers do. Thus, a set of points in which the stored differences to the center are small tend to lead to a good compression. 52 4 Halite Algorithm 5 : Function compress( ). Input: set of points δ γ Sk for a possible cluster δ γ Ck Output: compressed size for δ γ Sk 1: compute the principal components of δ γ Sk; 2: δ γ Pk = δ γ Sk, projected into all d principal components computed; 3: γ Ak = the set of d principal components computed; // Notice that Lines 2 and 3 state that δ γ Pk and γ Ak deﬁne the possible cluster δ γ Ck as δ γ Ck = γ Ak, δ γ Pk after its projection into the principal components of δ γ Sk. 4: maxj, minj = the maximum and the minimum values of δ γ Pk, in each principal component aj ∈ γ Ak; 5: size = the number of bits needed to store the descriptions of each principal component aj ∈ γ Ak, besides its related values for maxj and minj; 6: for every projected point pi ∈ δ γ Pk do 7: for j = 1, 2, ..., d do 8: // difference to the center wrt principal component aj ∈ γ Ak 9: size = size + number of bits needed to store (pij − (((maxj − minj)/2) + minj)); 10: end for 11: end for 12: return size; Let us illustrate how the compression idea works, by using the example datasets in Fig. 4.6d, Dotted rectangles, gray arrows and stars refer to β-clusters, principal components and cluster centers respectively. First, consider the data in Fig. 4.6a. For the β-clusters apart, Fig. 4.6b shows that we have ‘medium compression’, since the differences to the centers are small in one principal component of each clus- ter and large in the other. However, Fig. 4.6c shows that the compression is worse for the combined cluster, since the differences to the center are large in both prin- cipal components. Thus, Halites keeps the β-clusters apart. Notice that this step deﬁnes that the points in gray will belong to both clusters in the ﬁnal result. For the datainFig.4.6, it is respectively shown in Fig. 4.6e and f that the differences to the centers are small in one principal component and large in the other, both with the β-clusters in separate and combined. However, in the ﬁrst case, Halites stores the descriptions of two sets of principal components and the related values of maxj and minj for each component. A single set is stored for the combined cluster, which leads to better compression. Therefore, Halites decides to combine the β-clusters into a single correlation cluster. 4.6 Implementation Discussion A possible bottleneck in the algorithm described is related to computing the critical value θα j for the statistical test, line 14 of Algorithm 2. As shown in Sect. 4.3, the new algorithm carefully identiﬁes data space regions that refer to bumps in the point density and veriﬁes if these regions stand out in the data in a statistical sense, 4.6 Implementation Discussion 53 thus spotting clusters. The Binomial distribution Binomial(n, p) is the base for the statistical test. But, computing the critical value with the exact Binomial distribution may become a bottleneck for large values ofn. Fortunately, an efﬁcient approximation to the Binomial(n, p) is given by the Normal distribution Normal(n.p, n.p.(1 − p)). Also, it is common sense in statistics that the approximation quality is excellent when n.p > 5 ∧ n.(1 − p)>5. Thus, both Halite and Halites compute θα j using the normal approximation to the Binomial distribution whenever this rule applies. The exact computation is very efﬁcient in all other cases. 4.7 Experimental Results This section presents the experiments performed to test the algorithms described in the chapter. The experiments intend to answer the following questions: 1. Compared with seven of the recent and related works, how good is the clustering method Halite? 2. How do the new techniques scale up? 3. How sensitive to the input parameters are the new techniques? 4. What are the effects of soft clustering in data with high probability of cluster overlap and, compared with a well-known, soft clustering algorithm, how good is Halites? All experiments were made in a machine with 8.00GB of RAM using a 2.33GHZ core. The new methods were tuned with a ﬁxed conﬁguration in all experiments, i.e., default values for α = 1.0E − 10 and H = 4. The justiﬁcation for this choice is in the upcoming Sect. 4.7. Finally, notice that results on memory usage are not reported neither for Halite nor for Halites, since both allow the efﬁcient use of data partially stored in disk, as described in Sect. 4.4. Note however that the experiments do compare the memory needs of the related works with those of Halite0. Moreover, remember that Halite0, Halite and Halites have similar space complexity, the differ- ence being that Halite and Halites do manage the memory in disk if the Counting-tree does not ﬁt in main memory. 4.7.1 Comparing Hard Clustering Approaches This section compares Halite with seven of the top related works over synthetic and real data. The techniques are: ORCLUS [2, 3], COPAC [1], CFPC [19], HARP [18], LAC [6], EPCH [13] and P3C [11, 12]. All methods were tuned to ﬁnd dis- joint clusters. The code of ORCLUS was kindly provided by Kevin Y. Yip and the project Biosphere. The source codes for all other methods were kindly provided by their original authors (i.e., Arthur Zimek and Elke Achtert provided COPAC; 54 4 Halite Man Lung Yiu and Nikos Mamoulis provided CFPC; Kevin Y. Yip provided HARP; Carlotta Domeniconi provided LAC; Raymond Chi-Wing Wong provided EPCH, and; Gabriela Moise provided P3C.). Results for Halite0 are also reported, which allows one to evaluate the improvements presented on the basic algorithm. 4.7.1.1 Evaluating the Results The quality of each result given by each technique is measured based on the well- known precision and recall measurements. The evaluation distinguishes between the clusters known to exist in a dataset dS, which are named as real clusters, and those that a technique ﬁnds, which are named as found clusters. A real cluster δ rCk = rAk, δ rSk is deﬁned as a set rAk of δ axes, aligned or not to the original axes, together with a set of points δ rSk densely clustered when projected into the subspace formed by rAk. Notice that the symbol A is used here in place of the symbol E to represent a set of axes, since the axes in rAk can be original axes, but they can also be linear combinations of the original axes, i.e., rAk is not necessarily a subset of E. A found cluster δ f Ck = f Ak, δ f Sk follows the same structure of a real cluster, using the symbol f instead of r. Finally, f k and rk respectively refer to the numbers of found and real clusters existing in dataset dS. For each found cluster δ f Ck, its most dominant real cluster δ r Ck is identiﬁed by the following equation. δ r Ck ||δ f Sk ∩ δ r Sk |=max(|δ f Sk ∩ δ r Sk |), 1 ≤ k ≤ rk (4.2) Similarly, for each real cluster δ r Ck , its most dominant found cluster δ f Ck is iden- tiﬁed by the equation as follows. δ f Ck ||δ r Sk ∩ δ f Sk|=max(|δ r Sk ∩ δ f Sk |), 1 ≤ k ≤ f k (4.3) The precision and the recall between a found cluster δ f Ck and a real cluster δ r Ck are computed as follows. precision = δ f Sk ∩ δ r Sk δ f Sk (4.4) recall = δ f Sk ∩ δ r Sk δ r Sk (4.5) To evaluate the quality of a clustering result, one averages the precision (Eq. 4.4) for all found clusters and their respective most dominant real clusters. Also, one averages the recall (Eq. 4.5) for all real clusters and their respective most dominant 4.7 Experimental Results 55 found clusters. These two averaged values are closely related to well-known mea- surements. The ﬁrst one is directly proportional to the dominant ratio [2, 13], while the second one is directly proportional to the coverage ratio [13]. The harmonic mean of these averaged values is named as Quality. The evaluation of a clustering result with regard to the quality of the subspaces uncovered is similar. One also computes the harmonic mean of the averaged precision for all found clusters and the averaged recall for all real clusters, but exchanging the sets of points (S sets) in the two last equations, Eqs. 4.4 and 4.5,bysetsofaxes(A sets). This harmonic mean is named as Subspaces Quality. Finally, in the cases where a technique does not ﬁnd clusters in a dataset, the value zero is assumed for both qualities. 4.7.1.2 System Conﬁguration Halite uses ﬁxed input parameter values, as deﬁned in the upcoming Sect. 4.7. Halite0 was tuned in the same way. The other algorithms were tuned as follows. ORCLUS, LAC, EPCH, CFPC and HARP received as input the number of clusters present in each dataset. Also, the known percent of noise for each dataset was informed to HARP. The extra parameters of the previous works were tuned as in their original authors’ instructions. LAC was tested with integer values from 1 to 11, for the para- meter 1/h. However, its run time differed considerably with distinct values of 1/h. Thus, a time out of three hours was speciﬁed for LAC executions. All conﬁgura- tions that exceeded this time limit were interrupted. EPCH was tuned with integer values from 1 to 5 for the dimensionalities of its histograms and several real values varying from 0 to 1 were tried for the outliers threshold. For the tests in P3C, the values 1.0E − 1, 1.0E − 2, 1.0E − 3, 1.0E − 4, 1.0E − 5, 1.0E − 7, 1.0E − 10 and 1.0E − 15 were tried for the Poisson threshold. HARP was tested with the Conga line data structure. CFPC was tuned with values 5, 10, 15, 20, 25, 30 and 35 for w, values 0.05, 0.10, 0.15, 0.20 and 0.25 for α, values 0.15, 0.20, 0.25, 0.30 and 0.35 for β and 50 for maxout. ORCLUS was tested with its default values for α = 0.5 and k0 = 15k, where k is the known number of clusters present in each dataset. It also received as input the known average cluster dimensionality of each synthetic dataset, and all possible dimensionalities were tested for the real data. COPAC was tuned with its default values for α = 0.85 and k = 3d and its default distance function was used. Its parameter ε was deﬁned as suggested in COPAC’s original publication and μ received the smallest value between k and the known size of the smallest cluster present in each dataset, since COPAC demands μ ≤ k. Notice two remarks: (a) each non-deterministic previous work was ran 5 times in each possible conﬁguration and the results were averaged. The averaged values were taken as the ﬁnal result for each conﬁguration; (b) all results reported for the previous methods refer to the conﬁgurations that led to the best Quality value, over all possible parameters tuning. 56 4 Halite 4.7.1.3 Synthetic Data Generation Synthetic data were created following standard procedures used by most methods described in Chap. 3, including the tested methods. The details are in Algorithm 6. In a nutshell: (1) the used procedure initially created axes-aligned, elliptical clusters of random sizes that follow normal distributions with random means and random variances in at least 50 % of the axes (relevant axes), spreading through at most 15 % of these axes domains. In other axes, the irrelevant ones, all clusters follow the uniform distribution, spreading through the whole axes domains; and (2) an optional data rotation allowed creating clusters not aligned to the original axes. In this step, each dataset was rotated four times in random planes and random degrees. Algorithm 6 was used to create synthetic datasets organized in several groups. A ﬁrst group of datasets was created to analyze each tested method with regard to increasing numbers of points, axes and clusters. It contains 7 non-rotated datasets with d,ηand γ k growing together from 6 to 18, 12 to 120k and 2 to 17 respectively. Noise percentile was ﬁxed at 15 %. For identiﬁcation purposes, the datasets are named Algorithm 6 : Function generate_one_dataset( ). Input: dimensionality d, cardinality η, number of clusters γ k, percent of noise pN, choice between axes-aligned and arbitrarily oriented clusters rotate Output: one synthetic dataset dS 1: // generates the noise points dS = η ∗ (pN/100) random points in [0, 1)d; 2: // deﬁnes the sizes of the clusters c[]=array of γ k random integers, such that k c[k]=η ∗ ((100 − pN)/100); 3: for k = 1, 2, ...,γ k do 4: // empty matrix (new cluster) deﬁne cluster[c[k]][d]; 5: // δ ≥ 50 % of d δ = random integer in [ d 2 , d]; 6: // relevant axes, at least 50 % randomly pick δ axes; 7: for j = 1, 2, ..., d do 8: if axis ej was picked then 9: // the cluster will spread through at most 15 % of the domain of ej choose random mean and variance, such that the Normal(mean, variance) distribution generates values in [0, 1) that differ at most 0.15 from each other; 10: cluster[:][j]=c[k] real numbers following Normal(mean, variance); 11: else cluster[:][j]=c[k] random values in [0, 1); 12: end if 13: end for 14: insert each row of cluster[][]as a point in dS; 15: end for 16: if rotate then 17: // optional step of rotation rotate dS four times in random planes and in random degrees, and then normalize dS to [0, 1)d; 18: end if 19: return dS; 4.7 Experimental Results 57 6, 8, 10, 12, 14, 16 and 18d according to their dimensionalities. Rotated versions of these datasets were also created to analyze each method over clusters not aligned to the original axes. These datasets are named 6, 8, 10, 12, 14, 16 and 18d_r. The strategy employed to create the 14d dataset (14 axes, 90k points, 17 clusters, 15 % noise and non-rotated) was the base for the creation of the other groups of datasets. Based on it, 3 groups of datasets were created varying one of these char- acteristics: numbers of points, axes or clusters. Each group has datasets, created by Algorithm 6, in which a single characteristic changes, while all others remain the same as the ones in the 14d dataset. The number of points grows from 50 to 250 k, the dimensionality grows from 5 to 30 and the number of clusters grows from 5 to 25. The names Xk, Xc and Xd_s refer respectively to datasets in the groups changing numbers of points, clusters or axes. For example, dataset 30d_s differs from 14d only because it has 30 axes instead of 14. 4.7.1.4 Results on Synthetic Data This section compares the methods on synthetic data. Results for clustering quality, memory consumption and run time are shown. For easy reading the graphs of the section, the values obtained for each method were linked and all vertical axes related to run time or to memory consumption were plotted in log scale. When a method found no cluster in a dataset, despite it was ran several times for each possible parameter conﬁguration, the respective values measured for run time and also for memory consumption were ignored, as in most cases each run led to different values measured due to the distinct conﬁgurations used. When a method used disk cache, its run time was ignored too. In these cases, no lines link the values obtained for the respective methods in the graphs. Figure 4.7 presents the results for run time and for clustering accuracy. Figure 4.7a shows that Halite, Halite0, EPCH, HARP and LAC presented similar high Quality values for all the datasets in the ﬁrst group. CFPC presented a clear decrease in Quality when the dimensionality was higher than 12. P3C, ORCLUS and COPAC had worse Quality results. Regarding run time, Fig. 4.7b shows that Halite was in general the fastest algorithm, loosing by little to Halite0 for small data. As an example, for the biggest dataset 18d, Halite was respectively 6, 17, 54, 145, 515, 1, 440 and 10, 255 times faster than Halite0, CFPC, EPCH, LAC, P3C, ORCLUS and COPAC. The results for data with individual changes in numbers of points, axes and clus- ters are shown in Fig. 4.7,from4.7cÐh. Notice that, Halite, Halite0, LAC and EPCH performed well in Quality for all cases, exchanging positions but being in gen- eral within 10 % from each other with no one being prevalent. Notice also that Halite and Halite0 always showed the same Quality. ORCLUS, COPAC, CFPC, HARP and P3C performed worse than that. Halite was again the fastest method in almost all cases, only tying with Halite0 for low dimensional datasets. As an exam- ple, for the dataset with the highest dimensionality, 30d_s, Halite ran respectively 9, 25, 36, 419, 1, 542, 3, 742 and 5, 891 times faster than Halite0, CFPC, LAC, P3C, ORCLUS, HARP and COPAC. 58 4 Halite 0 0.2 0.4 0.6 0.8 1 6d 8d 10d 12d 14d 16d 18d Quality First group of datasets P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 0.2 0.4 0.6 0.8 1 50k 100k 150k 200k 250k Quality Number of points P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 0.2 0.4 0.6 0.8 1 5c 10c 15c 20c 25c Quality Number of clusters P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 0.2 0.4 0.6 0.8 1 5d_s 10d_s 15d_s 20d_s 25d_s 30d_s Quality Dimensionality P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 0.2 0.4 0.6 0.8 1 6d_r 8d_r 10d_r 12d_r 14d_r 16d_r 18d_r Quality Rotated datasets P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 1 10 100 1,000 10,000 100,000 6d 8d 10d 12d 14d 16d 18d Seconds First group of datasets P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 1 10 100 1,000 10,000 100,000 1,000,000 50k 100k 150k 200k 250k Seconds Number of points P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 1 10 100 1,000 10,000 100,000 5d_s 10d_s 15d_s 20d_s 25d_s 30d_s Seconds Dimensionality P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 0 1 10 100 1,000 10,000 100,000 6d_r 8d_r 10d_r 12d_r 14d_r 16d_r 18d_r Seconds Rotated datasets P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC 1 10 100 1,000 10,000 100,000 5c 10c 15c 20c 25c Seconds Number of clusters P3C LAC EPCH CFPC HARP ORCLUS COPAC CorCo CorC (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Comparing hard clustering approaches Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Halite0 Halite Fig. 4.7 Halite is shown in black vertical crossing lines. Left column: quality; right column: wall- clock time in log scale. Comparison of approaches for hard clustering - Halite was in average at least 12 times faster than seven top related works, always providing high quality clusters 4.7 Experimental Results 59 100 10,000 1,000,000 100,000,000 6d 8d 10d 12d 14d 16d 18d KB First group of datasets P3C LAC EPCH CFPC HARP CorC 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 50k 100k 150k 200k 250k KB Number of points P3C LAC EPCH CFPC HARP CorC 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 5c 10c 15c 20c 25c KB Number of clusters P3C LAC EPCH CFPC HARP CorC 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 KB Dimensionality P3C LAC EPCH CFPC HARP CorC (a) (b) (c) (d) (e) 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 KB Rotated datasets P3C LAC EPCH CFPC HARP CorC Memory consumption results 0 Halite0 Halite0 Halite0 Halite0 Halite0 Fig. 4.8 Results on memory consumption for synthetic data Another experiment refers to the rotated data. It analyzes each method’s abilities to ﬁnd clusters in subspaces formed by linear combinations of original axes. The results are in Fig. 4.7i and j. Halite, Halite0 and LAC were only marginally affected by rotation, varying at most 5 % in their respective Quality values, compared to the results of the same data without rotation. All others had considerable decreased or increased Quality values for at least one case. Run time results were similar to those obtained for non-rotated data. The results on memory usage are presented in Fig. 4.8. Results for P3C, LAC, EPCH, CFPC, HARP, and also for the method Halite0 are reported. Figure 4.8 shows that, in all cases, there was a huge discrepancy between HARP and EPCH face to the others with regard to memory usage. Note the log scale in every Y-axis. As an example, for dataset 18d, the biggest one into the ﬁrst group of datasets, HARP used approximately 34.4GB of memory and EPCH used 7.7% of this amount, while Halite0 used only 0.3% of the memory required by HARP. The quality of relevant axes was also evaluated. LAC, COPAC and ORCLUS were not tested here, as they do not return a set of original axes to deﬁne the axes relevant to each cluster. The results for the ﬁrst group of datasets are in Fig. 4.9.The Subspaces Quality values are similar for Halite, Halite0 and EPCH. All others had worse results. The same pattern was seen in the other datasets. 60 4 Halite Fig. 4.9 Subspace quality for synthetic data 0 0.2 0.4 0.6 0.8 1 6d 8d 10d 12d 14d 16d 18d Subspaces Quality First group of datasets P3C EPCH CFPC HARP Halite0 Halite Concluding, P3C had the worst Quality values in most cases. HARP, CFPC, ORCLUS and COPAC provided average Quality values for some datasets, but the results were not good in several cases. Halite, Halite0, LAC and EPCH had the best results, in general tying with regard to Quality values. However, contrasting to Halite and also to Halite0, the methods LAC and EPCH demanded guessing the number of clusters and required distinct threshold tuning to each dataset to obtain their best results reported. Regarding memory needs, remember that Halite0, Halite and Halites have similar space complexity, the difference being that Halite and Halites do manage the memory in disk if the Counting-tree does not ﬁt in main memory. With that in mind, the memory needs of the related works were compared with those of Halite0. This comparison shows that CFPC in general required the least amount of memory, followed by LAC, Halite0, P3C, EPCH and HARP which respectively required 1.2, 2.8, 6.5, 112 and 600 times more memory than CFPC in average. There- fore, the memory consumption of Halite0 was similar to those of top related works. Regarding run time, Halite was the fastest method in almost all cases, only loosing by little to Halite0 in some small datasets. The improvements on the basic imple- mentation allowed Halite to be about one order of magnitude faster than Halite0 for large data. Halite also avoids the use of disk cache. Finally, when comparing to the related works, Halite was in average 12, 26, 32, 475, 756, 2, 704 and 7, 218 times faster than CFPC, LAC, EPCH, P3C, ORCLUS, HARP and COPAC respectively. Notice that Halite was in average at least 12 times faster than seven recent and related works, always providing high quality clusters. 4.7.1.5 Real Data The real dataset used to test the methods is the training data provided for the Siemens KDD Cup 2008.3 It was created for automatic breast cancer diagnosis, consisting of 25 of the most signiﬁcant features extracted automatically from 102,294 Regions of Interest (ROIs) present in X-ray breast images of 118 malignant cases and 1,594 normal cases. Ground truth is also provided. The data was partitioned into four datasets, each containing features extracted from homogeneous images, i.e., each 3 “http://www.kddcup2008.com” 4.7 Experimental Results 61 Fig. 4.10 Quality versus run time in linear-log scale over 25-dimensional data for breast cancer diagnosis (KDD Cup 2008). Halite was at least 11 times faster than 5 previous works (2 other failed), increasing their accuracy in up to 35%. Similar behavior occurred in synthetic data dataset has features extracted from ∼25, 000 ROIs related to images taken from one breast, left or right, in one of the possible directions, CC or MLO. The Quality results were computed based on the ground truth class label of each ROI. 4.7.1.6 Results on Real Data All methods were tested with the real data. However, LAC and P3C failed for all four datasets in all tested conﬁgurations. LAC always grouped all points into a single cluster. P3C did not ﬁnish within a week for all cases. Thus, they are not reported. The results for left breast images in one MLO view are shown in Fig. 4.10. Halite was at least 11 times faster than the previous works, increasing their accuracy in up to 35%. The other three real datasets led to similar results. 4.7.2 Comparing Soft Clustering Approaches This section compares Halites with STATPC [10], a well-known, state-of-the-art soft clustering method. The original code for STATPC was used, which was gracefully provided by Gabriela Moise. The input parameter values used for STATPC were the default values suggested by its authors, α0 = 1.0E − 10 and αK = αH = 0.001. Halites uses ﬁxed input parameter values, as deﬁned in the upcoming Sect. 4.7.4. The methods were compared in a scenario with high probability of cluster overlap, the example scenario with satellite imagery from Sect. 4.5. 14 high quality satellite images from famous cities were analyzed, as the city of Hong Kong in Fig. 4.11a. The images, available at “geoeye.com”, have a combined size of 17 MB. Each image was divided into equal-sized rectangular tiles, from which Haar wavelets features were extracted. The process led to a 10-dimensional dataset of 14, 336 points. Figure 4.11b and c respectively exemplify the results for STATPC and for Halites over this data by coloring each tile from the example image of Hong Kong according to its cluster. As expected, some tiles belong to more than one cluster. These were 62 4 Halite Hong Kong STATPC Water Concrete Sand Halites Water Concrete Sand Halite Water Land (a) (b) (d)(c) Fig. 4.11 Comparing Halite and Halites with STATPC on data with cluster overlap. As expected, hard clustering leads to correct, but less detailed results: roughly speaking, Halites and STATPC found clusters of ‘water’ tiles (cyan), ‘sand’ tiles (red) and ‘concrete’ tiles (green); Halite merged the last two clusters into a cluster of ‘land’ tiles (red). The results for Halites and STATPC are similar. Notice, however, that Halites found the clusters in only two seconds, while STATPC took two days to perform the same task. Therefore, the new solution can be used in real time applications. IKONOS/GeoEye-1 Satellite image courtesy of GeoEye colored according to their ﬁrst clusters assigned. Notice that both results are sim- ilar, with clusters that represent the main patterns apparent in the example image. However, STATPC took two days to ﬁnd the clusters, while Halites performed the same task in only two seconds. Similar results were obtained for the other 13 images. Notice that these results indicate that the new method allows the development of real time applications, like a software to aid on the ﬂy the diagnosis process in a world- wide Healthcare Information System or a system to look for deforestation within the Amazon Rainforest in real time. Finally, results for Halite (Fig. 4.11d) are reported, which, as expected, are still correct, but provide less details compared to the soft clustering ones. Roughly speak- ing, Halites and STATPC found clusters of ‘water’ tiles (cyan), ‘sand’ tiles (red) and ‘concrete’ tiles (green), whereas Halite merged the last two clusters into a single cluster of ‘land’ tiles (red). These results corroborate the conjecture from Sect. 4.5, that soft clustering is more appropriate to this kind of data. 4.7 Experimental Results 63 Fig. 4.12 Scalability of Halite and Halites on syn- thetic data of varying sizes and dimensionality. Plots in log-log scale. Notice: both methods scale as expected, according to the theoretical complexity analysis from Sect. 4.3 100k scalability wrt the Data Size scalability wrt the Dimensionality 1 million 1.0 slope 1.0 slope Halite Halites Halite Halites (a) (b) 4.7.3 Scalability This section analyzes the scalability of Halite and Halites with regard to increas- ing data sizes and dimensionality. Synthetic datasets were created by Algorithm 6. The data sizes and dimensionality vary from 100 k to 1 million points and from 5 to 30 axes respectively. The datasets have 10 axes-aligned clusters each and 15% of noise. Notice in Fig. 4.12 that the new techniques scale as expected, according to the theoretical complexity analysis, presented in Sect. 4.3. 4.7.4 Sensitivity Analysis The behavior of the new techniques varies based on two parameters: α and H.This section analyzes how they affect these methods. Both parameters were varied for Halite, Halite0 and Halites to maximize the Quality values obtained, deﬁning the best conﬁguration for each dataset and technique. Then, for each dataset and technique, the best conﬁguration was modiﬁed, changing one parameter at a time, and the technique’s behavior was analyzed. For example, when varying H for a dataset and 64 4 Halite 0 0.2 0.4 0.6 0.8 1 Quality alpha alpha 6d 8d 10d 12d 14d 16d 18d 0.1 1 10 100 Seconds 6d 8d 10d 12d 14d 16d 18d 0.84 0.88 0.92 0.96 1 4 5 10 20 40 80 Quality 6d 8d 10d 12d 14d 16d 18d 0.1 1 10 100 4 5 10 20 40 80 Seconds H H 6d 8d 10d 12d 14d 16d 18d (a) (c) (b) (d) Fig. 4.13 Sensitivity analysis that deﬁned the default conﬁguration, α = 1E − 10 and H = 4 technique, the value of α was ﬁxed at the value in the respective best conﬁguration. The tested values of α and H vary from 1.0E −3to1.0E −160 and from 4 to 80 respectively. Figure 4.13 reports the results of Halite0. Figure 4.13a and c present the results regarding α. The values of α that led to the best Quality vary from 1.0E − 5 to 1.0E − 20 and the run time was barely affected by changes in α. Concerning H, Fig. 4.13b and d show that the Quality does not increase signiﬁcantly for H higher than 4. However, the run time increased as expected regarding H. Thus, small values for H, such as 4, are enough for most datasets. Similar results were obtained by Halite0, Halite and Halites for all synthetic and real datasets. Therefore, the values α = 1.0E − 10 and H = 4 are considered the default conﬁguration for the new techniques. These ﬁxed values were used in all experiments. 4.8 Discussion This section provides a discussion on some speciﬁc characteristics of the new clus- tering methods described. As a ﬁrst topic to discuss, notice that the new method looks for dense space regions, and thus it works for any data distribution. This fact was illustrated on rotated Gaussians, as well as on real data of unknown distribution. The quadratic behavior on the number of β-clusters found is not a crucial concern for the new techniques. Experimental evaluation showed that this number closely follows the ground truth number of clusters existing in the tested datasets. In the experiments, the biggest number of β-clusters found over all synthetic and real datasets was 33. Notice that, the biggest number of clusters present in these data is 25. The results for the ﬁrst group of synthetic datasets are in Fig. 4.14, which shows a plot with the Y-axis referring to each dataset tested, and horizontal bars representing the 4.8 Discussion 65 0 5 10 15 20 25 6d 8d 10d 12d 14d 16d 18d First group of datasets Fig. 4.14 Ground truth number of clusters versus the number of β-clusters found by Halite over synthetic data respective number of β-clusters found by Halite, besides the ground-truth number of clusters. The same pattern was seen in all other datasets analyzed. Furthermore, the analysis of data with many clusters is usually meaningless, as it is hard to the user to obtain semantic interpretations from a large number of clusters. Additionally, the quadratic behavior on H is not a relevant concern for Halite, since very small values for H are commonly sufﬁcient to obtain accurate clustering results. Remember that a Counting-tree describes a d-dimensional dataset in H dis- tinct resolutions. Each cell in the ﬁrst resolution level h = 0 is divided in 2d cells in level h + 1, which are divided again in 2d cells in level h + 2 each, and so on. The process stops when h = H − 1. Thus, for moderate-to-high dimensionality data, the maximum count of points in a cell of tree level h converges exponentially to 1, as the value of h increases to reach H − 1. After the point of convergence, even for a skewed dataset, more levels tend to be useless, since they would not help to better describe the data. The sensitivity analysis in Sect. 4.7 corroborate this claim. Also, notice that, Halite is limited to the size of the clusters that it ﬁnds. The method analyses the points distribution in speciﬁc regions of the data space with all dimensions using an statistical hypothesis test to identify β-clusters, which lead to the clusters. But, these regions must have a minimum amount of points to reject the null hypothesis. In this way, Halite may miss clusters with small amount of points present in low-dimensional subspaces, as they tend to be extremely sparse in spaces with several dimensions. On the other hand, the clustering results tend to be better as the number of points in the clusters increase. Thus, Halite is suitable for large, multi-dimensional datasets and as it scales linearly on the data size, it becomes better as the dataset increases. In this way, Halite tends to be able to spot accurate clusters even from datasets with more than 30 axes, when they are large enough. Notice one last remark: the traditional clustering method STING [17]isabasis to the work described in this chapter. Similarly to Halite, STING also does multi- resolution space division in a statistical approach for clustering. However, STING is a traditional clustering method that proposes to only analyze two or very low dimen- sional data. It is not suitable for moderate-to-high dimensionality data clustering, since STING does not spot clusters that only exist in subspaces of the original data space. Also, STING uses a ﬁxed density threshold to ﬁnd clusters, whereas Halite 66 4 Halite applies a novel spike detection strategy based on convolution masks to ﬁnd possible clusters, and then Halite conﬁrms the clusters by using a statistical test to identify the spikes that are signiﬁcantly denser than their neighboring space regions. Finally, as opposed to Halite, STING does not include a soft clustering approach. 4.9 Conclusions This chapter described the new method Halite for correlation clustering. Previous methods are typically super-linear in space or execution time. The main strengths of Halite are that it is fast and scalable, while still giving highly accurate results. In details, the main contributions of Halite are: 1. Scalability: it is linear in time and space with regard to the data size and to the dimensionality of the clusters. Halite is also linear in memory usage and quasi- linear in running time regarding the space dimensionality; 2. Usability: it is deterministic, robust to noise, does not have the number of clusters as a parameter and ﬁnds clusters in subspaces formed by original axes or by their linear combinations, including space rotation; 3. Effectiveness: it is accurate, providing results with equal or better quality com- pared to top related works; 4. Generality: it includes a soft clustering approach, which allows points to be part of two or more clusters that overlap in the data space. Speciﬁcally, the new algorithm encompasses: Halite0, a basic implementation for the clustering method; Halite, the optimized, and ﬁnally recommended method for hard clustering, and; Halites, the recommended implementation for soft clustering. A theoretical study on the time and space complexity of Halite, shown in Sect. 4.3, as well as an extensive experimental evaluation performed over synthetic and real data spanning up to 1 million elements corroborate these properties. Speciﬁcally, the experiments compared Halite with seven representative works. On synthetic data, Halite was consistently the fastest method, always presenting highly accurate results. Regarding real data, Halite analyzed 25-dimensional data for breast cancer diagnosis (KDD Cup 2008) at least 11 times faster than ﬁve previous works, increasing their accuracy in up to 35%, while the last two related works failed. Halite is the ﬁrst algorithm described in this book that focuses on data mining in large sets of complex data. The next chapters describe two other algorithms aimed at tackling this difﬁcult problem. References 1. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efﬁcient correlation clustering.In: SDM, USA (2007). References 67 2. Aggarwal, C., Yu, P.: Redeﬁning clustering for high-dimensional applications.IEEE TKDE 14(2), 210Ð225 (2002). doi:10.1109/69.991713 3. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec. 29(2), 70Ð81 (2000). doi:10.1145/335191.335383 4. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: F. Li, M.M. Moro, S. Ghandeharizadeh, J.R. Haritsa, G. Weikum, M.J. Carey, F. Casati, E.Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, V.J. Tsotras (eds.) ICDE, pp. 625Ð636. IEEE (2010). 5. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Halite: Fast and scalable multi- resolution local-correlation clustering. IEEE Transactions on Knowledge and Data Engineering 99(PrePrints) (2011). doi:10.1109/TKDE.2011.176.16 pages 6. Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Min. Knowl. Discov. 14(1), 63Ð97 (2007). doi:10.1007/s10618-006-0060-8 7. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (2006) 8. Grunwald, P.D., Myung, I.J., Pitt, M.A.: Advances in Minimum Description Length: Theory and Applications (Neural Information Processing). The MIT Press (2005). 9. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer-Verlag, New York, USA (2002) 10. Moise, G., Sander, J.: Finding non-redundant, statistically signiﬁcant regions in high dimen- sional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533Ð541 (2008). 11. Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414Ð425. IEEE Computer Society (2006). 12. Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273Ð298 (2008). doi:10.1007/s10115-007-0090-6 13. Ng, E.K.K., chee Fu, A.W., Wong, R.C.W.: Projective clustering by histograms. TKDE 17(3), 369Ð383 (2005). doi:10.1109/TKDE.2005.47 14. Rissanen, J.: Stochastic Complexity in Statistical Inquiry Theory. World Scientiﬁc Publishing Co., Inc., River Edge, NJ, USA (1989) 15. Traina Jr, C., Traina, A.J.M., Faloutsos, C., Seeger, B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans. Knowl. Data Eng. 14(2), 244Ð260 (2002) 16. Traina Jr., C., Traina, A.J.M., Wu, L., Faloutsos, C.: Fast feature selection using fractal dimen- sion. In: SBBD, pp. 158Ð171 (2000). 17. Wang, W., Yang, J., Muntz, R.: Sting: A statistical information grid approach to spatial data mining.In: VLDB, pp. 186Ð195 (1997). 18. Yip, K., Cheung, D., Ng, M.: Harp: a practical projected clustering algorithm. TKDE 16(11), 1387Ð1397 (2004). doi:10.1109/TKDE.2004.74 19. Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176Ð189 (2005). doi:10.1109/TKDE.2005.29. Chapter 5 BoW Abstract The large amounts of data collected by enterprises are accumulating data, and today it is already feasible to have Terabyte- or even Petabyte-scale datasets that must be submitted for data mining processes. However, given aTerabyte-scaledataset of moderate-to-high dimensionality, how could one cluster its points? Numerous successful, serial clustering algorithms for data in ﬁve or more dimensions exist in literature, including the algorithm Halite that we described in the previous chapter. However, the existing algorithms are impractical for datasets spanning Terabytes and Petabytes, and examples of applications with such huge amounts of data in ﬁve or more dimensions abound (e.g., Twitter crawl: >12 TB, Yahoo! operational data: 5 Petabytes [6]). This limitation was previously summarized in Table 3.1. For datasets that do not even ﬁt on a single disk, parallelism is a ﬁrst class option, and thus we must re-think, re-design and re-implement existing serial algorithms in order to allow for parallel processing. With that in mind, this chapter presents one work that explores parallelism using MapReduce for clustering huge datasets. Speciﬁcally, we describe in detail one second algorithm, named BoW [5], that focuses on data mining in large sets of complex data. Keywords Correlation clustering · Terabyte-scale data mining · MapReduce · Hadoop · Big data · Complex data · Large graphs · Social networks 5.1 Introduction Given a Terabyte-scale dataset of moderate-to-high dimensional elements, how could one cluster them? Numerous successful, serial clustering algorithms for data in ﬁve or more dimensions exist in literature, including the algorithm Halite that we described in the previous chapter. However, the existing algorithms are impractical for data spanning Terabytes and Petabytes (e.g., Twitter crawl: >12 TB, Yahoo! operational data: 5 Petabytes [6]). In such cases, the data are already stored on multiple disks, R. L. F. Cordeiro et al., Data Mining in Large Sets of Complex Data,69 SpringerBriefs in Computer Science, DOI: 10.1007/978-1-4471-4890-6_5, © The Author(s) 2013 70 5BoW as the largest modern disks are 1Ð2 TB. Just to read a single Terabyte of data (at 5 GB/min on a single modern eSATA disk) one takes more than 3 hours. Thus, parallelism is not another option—it is by far the best choice. Nevertheless, good, serial clustering algorithms and strategies are still extremely valuable, because we can (and should) use them as ‘plug-ins’ for parallel clustering. Naturally, the best algorithm is the one that combines (a) a fast, scalable serial algorithm and (b) makes it run efﬁciently in parallel. This is exactly what the method described here does. Examples of applications with Terabytes of data in ﬁve or more dimensions abound: weather monitoring systems and climate change models, where we want to record wind speed, temperature, rain, humidity, pollutants, etc; social networks like Facebook TM, with millions of nodes, and several attributes per node (gender, age, number of friends, etc); astrophysics data, such as the Sloan Digital Sky Survey, with billions of galaxies and attributes like red-shift, diameter, spectrum, etc. This chapter focuses on the problem of ﬁnding clusters in subspaces of Terabyte- scale, moderate-to-high dimensionality datasets. The method described here uses MapReduce, and can treat as a plug-in almost any of the serial clustering methods, including the algorithm Halite described in the previous chapter. The major research challenges addressed are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be the bottleneck. Thus, we present the Best of both Worlds—BoW method [5] that automatically spots the bottleneck and chooses a good strategy. The main contributions of BoW are as follows: 1. Algorithm design and analysis: the method BoW is based on a novel, adaptive algorithm that automatically picks the best of two strategies and good parameters for it, hence its name Best of both Worlds. One of the strategies uses a novel sampling-and-ignore idea that reduces the network trafﬁc; 2. Effectiveness, scalability and generality: BoW can use most serial clustering methods as plug-ins, including the method Halite described before. BoW requires no user deﬁned parameters (due to its defaults) and it maintains the serial clus- tering quality, with near-linear scale-up; 3. Experiments: experiments on real and synthetic data with billions of elements were performed, using hundreds of machines running in parallel. Experiments were performed on a combination of large real and synthetic datasets, including the Yahoo! web one.1 To the best of our knowledge, the Yahoo! web is the largest real dataset for which results have ever been reported in the database clustering literature for data in ﬁve or more axes. Although spanning 0.2TBof multi-dimensional data, BoW took only 8 min to cluster it, using 128 cores. The experiments also used up to 1,024 cores, again the highest such number reported in the clustering literature for moderate-to-high dimensional data. Notice one important remark: BoW is tailored to spot clusters in subspaces of moderate-to-high dimensionality data and it can handle most serial algorithms as plug-ins, since the only required API is that the serial algorithm should return clusters 1 Provided by Yahoo! Research (www.yahoo.com). 5.1 Introduction 71 of points in hyper-rectangles, which we shall refer to as β-clusters in this book, whose deﬁnition follows the same one previously employed for the Halite algorithm, but which may also be provided by many other existing algorithms. Overlapping β-clusters are then merged to form clusters. Indeed, the intuition is to generalize the structure of isometric crystal systems to the d-dimensional case in order to describe clusters of any shape and size, existing in subspaces only, as it was extensively discussed in the previous Chap. 4. Remember that the clustering methods well- suited to analyze moderate-to-high dimensionality data spot clusters that exist only in subspaces of the original, d-dimensional space (i.e., spaces formed by subsets of the original axes or linear combinations of them). Thus, the natural shape of the clusters in the original space facilitates their representation with hyper-rectangles, as the points of each cluster spread linearly through several axes (original axes or linear combinations of them) in the original space. For that reason, many of the existing serial, clustering methods (e.g., CLIQUE [1, 2], FPC/CFPC [10], P3C [8, 9], STATPC [7], and Halite [3, 4]) return clusters in hyper-rectangles, and adapting others to work with BoW tends to be facilitated by the natural shape of the clusters. Nevertheless, besides focusing on spotting clusters in subspaces of moderate-to-high dimensionality data, BoW also works with traditional clustering methods and low dimensional data, if the plug-in returns clusters in hyper-rectangles. 5.2 Main Ideas of BoW: Reducing Bottlenecks The major research problems for clustering very large datasets with MapReduce are (a) how to minimize the I/O cost, and (b) how to minimize the network cost among processing nodes. Should we split the data points at random, across machines? What should each node do, and how should we combine the results? Do we lose accuracy (if any), compared to a serial algorithm on a huge-memory machine? The method described here answers all of those questions, by careful design and by adaptively trading-off disk delay and network delay. Speciﬁcally, we describe a novel, adaptive algorithm named BoW that is a hybrid between two strategies presented in this section: (1) the ParC method that does data partitioning and merges the results; and (2) the SnI method that does some sampling ﬁrst, to reduce communication cost at the expense of higher I/O cost. There is no universal winner between ParC and SnI, since it depends on the environment used and also on the dataset characteristics (see the upcoming Sect. 5.5 for details). BoW automatically picks the best option, and good parameters for it. The reason for the success of BoW is its upcoming cost- estimation formulas (Eqs. 5.4 and 5.5), which help BoW to pick the best alternative and to set proper parameters for the chosen environment, while requiring nimble computational effort. Next, we describe the methods ParC and SnI in detail. 5.2 Main Ideas of BoW: Reducing Bottlenecks 73 costM(m,Fs) costS(r,Fs) costR(r,Fs) negligible negligible P1 P2 P3 P4 P5 costM(m , Fs) costS(1,Fs • Sr ) costR(1,Fs • Sr ) negligible negligible costM(m , Fs) costS(r,Fs• Rr ) costR(r,Fs • Rr ) negligible S1 S2 S3 S4 S5 S6 S7 S8 S9 (a) (b) Fig. 5.1 Which one is best? Parallel run overview for ParC (left)andSnI (right—with sampling). ParC executes the map (P1), shufﬂe (P2) and reduce (P3) phases once, on the full dataset. SnI uses sampling (phases S1–S4) to get rough cluster estimates and then uses phases S5–S9 to cluster the remaining points (see Sect. 5.2.2 for details). Their clustering accuracies are similar (see the upcoming Sect. 5.5). The winning approach depends on the environment; BoW uses cost-based optimization to automatically pick the best. received from phase P4, and not the data elements themselves. The best strategy to follow in this step is highly dependent on the criteria used by the mapper to partition the data. Thus, ParC uses distinct procedures for distinct data partitioning criteria. The procedures used for each of the partitioning strategies studied are detailed in the upcoming Sect. 5.4. 5.2.2 Sample and Ignore: SnI The initial implementation for parallel clustering, the ParC algorithm, reads the dataset once aimed at minimizing disk accesses, which is the most common strategy used by serial algorithms to shrink computational costs. However, this strategy does not address the issue of minimizing the network trafﬁc: in the shufﬂe phase of the ParC algorithm (phase P2 of Fig. 5.1a) almost all of the records have to be shipped over the network to the appropriate reducer. It may become a considerable bottleneck. How can this network trafﬁc be reduced? The main idea in this section is to minimize the network trafﬁc for parallel clus- tering by exploiting the skewed distribution of cluster sizes that typically appears in real data. Most of the elements usually belong to a few large clusters, and these are exactly the elements that we try to avoid processing. Thus, we describe SnI, a parallel clustering algorithm that consists of: (a) a novel sample-and-ignore preprocessing 74 5BoW step; and (b) the ParC algorithm from Sect. 5.2.1.Thesample-and-ignore step works on a small dataset sample, spots the major clusters and ignores these clusters’ mem- bers in the follow-up steps. It signiﬁcantly reduces the amount of data moved in the shufﬂing phases of SnI, with consequent savings for the network trafﬁc, as well as the I/O cost for the intermediate results and processing cost for the receiving reduce tasks. Notice that this sample-and-ignore idea is an alternative, general strategy that can improve many clustering methods, and not only ParC. The SnI method is deﬁned in Algorithm 7 and the process is illustrated in Fig. 5.1b. 5.2 Main Ideas of BoW: Reducing Bottlenecks 75 Phase I – sampling (a) – input dataset (b) – clusters in sample Phase II – look for the clusters not found in the sample (a) – input dataset, excluding the space of clusters from Phase I (b) – reducer 1 (c) – reducer 2 (d) – merging (e) – final clusters Fig. 5.2 Overview of the multi-phase sample-and-ignore (SnI) method. Phase-I ﬁnds clusters on a sample of the input data. Phase-II ignores elements that fall within a previously found cluster and ﬁnds clusters using the remaining elements only. ﬁnds new clusters, denoted by the points surrounded by dotted boxes. In Phase-II (d), the clusters found by the reducers are merged with the clusters from the sampling phase using the upcoming merging/stitching strategies described in Sect. 5.4.The global set of clusters, containing three clusters represented in Phase-II (e) by distinct gray levels, is the ﬁnal output. The main beneﬁt of the SnI approach is realized in the shufﬂe/reduce stages. In Phases S2 and S3 of Fig. 5.1b, only a small sample is shufﬂed and processed by a receiving reducer. In Phases S6 and S7 of Fig. 5.1b, only the non-ignored elements may need to be shufﬂed through the network to other machines and processed. This means that most elements belonging to the major clusters spotted in the sample are ignored, never being shufﬂed through the network nor processed by a reducer. Compared to the ParC algorithm, SnI signiﬁcantly minimizes the network cost and the reducers processing, at the cost of reading the whole dataset twice. In other words, 76 5BoW ParC does a single pass over the data, but almost all of the records have to be shipped over the network (in phase P2 of Fig. 5.1a) to be processed by the appropriate reducer. On the other hand, SnI minimizes the shufﬂe/reduce cost, at the expense of reading the dataset one extra time. What approach is the best? The answer is given in Sect. 5.3. 5.3 Cost-Based Optimization of BoW This section presents an adaptive, hybrid method named BoW (Best of both Worlds) that exploits the advantages of the previously described approaches, ParC and SnI, taking the best of them. There is no universal winner, since it depends on the environ- ment and on the dataset characteristics. See the upcoming Sect. 5.5 for a complete explanation. Therefore, the main open question is: When should the novel sampling- and-ignore idea be used and when should it be avoided? ParC executes the map, shufﬂe and reduce phases only once on the whole dataset. SnI reduces the amount of data to be shipped to and processed by the reducers, at the expense of a second pass on the input data (in the map phase). This section presents a cost-based optimization that uses simple analytics models to estimate the running time of each clustering strategy. BoW picks the strategy with the lowest estimated cost. The environmental parameters required by BoW are presented in Table 5.1.They describe the hardware characteristics (i.e., the specs of the available MapReduce cluster), the total amount of data to be processed, and the cost estimate for the plugged-in serial clustering method. Setting the value for Fs is straightforward. Ds, Ns and start_up_cost(t) are inferred by analyzing the cloud of computers’ logs, while plug_in_cost(s) is deﬁned based on the plugged-in method’s original time complexity analysis and/or experiments, or measured by the user in a simple experi- ment. Notice: each machine in the cloud may run many MapReduce tasks (mappers and/or reducers) in parallel, sharing the machine’s disks and network connection. Therefore, Ns and Ds are expected to be smaller than the effective network bandwidth and disk transfer rate respectively. Two other parameters are used, shown in Table 5.2, and reasonable default values are provided for them based on empirical evaluation. Notice one important remark: As is the common knowledge in database query optimization, at the cross-over point of two strategies, the wall-clock-time performances usually create ﬂat plateaus, being not much sensitive to parameter variations. This occurs in BoW’s setting, and the results in the upcoming Figs. 5.9a, 5.10a and d exemplify it (notice the logÐlog scale). Thus, tuning exact values to these parameters barely affects BoW’s results and the suggested values are expected to work well in most cases. The following lemmas and proofs deﬁne the equations of the cost-based opti- mization. First, we describe the expected costs for complete map, shufﬂe and reduce phases relative to the number of mappers and/or reducers available and to the amount of data involved. Then, we show BoW’s infered costs for: ParC, which minimizes disk accesses, and; SnI, which aims at shrinking the network cost. For clarity, consider 5.3 Cost-Based Optimization of BoW 77 Table 5.1 Environmental parameters Parameter Meaning Explanation Fs data ﬁle size (bytes) Size of the dataset to be clustered Ds disk speed Average number of bytes per second that a (bytes/sec) MapReduce task (mapper or reducer) is able to read from local disks, i.e. the average disk transfer rate per MapReduce task Ns network speed Average bytes/sec. that a MapReduce task (bytes/sec) (mapper or reducer) is able to read from other computers in the cloud, i.e. the average network transfer rate per MapReduce task start_up_cost(t) start-up cost Average time to start-up t MapReduce tasks (seconds) (mappers or reducers) plug_in_cost(s) plug-in cost Average time to run the plugged-in serial (seconds) method over s data bytes on a standard computer in the cloud again Fig. 5.1 that provides a graphical overview of the parallel execution of both methods, including their expected cost equations. Lemma 5.1 Map Cost - the expected cost for the map phase of the parallel clustering approaches is a function of the number of mappers m used and the involved data size s, given by: costM(m, s) = start_up_cost(m) + s m · 1 Ds (5.1) Proof In the map phase, m mappers are started-up at the cost of start_up_cost(m). Additionally, the majority of the time spent is related to reading the input dataset Table 5.2 Other parameters Parameter Meaning Explanation Default values Dr dispersion Ratio of data transferred in the 0.5 ratio shufﬂe phase through the network (distinct machines) relative to the total amount of data processed Rr reduction Ratio of data that do not belong 0.1 ratio to the major clusters found in the sampling phase of SnI relative to the full data size Fs 78 5BoW from disk. s bytes of data will be read in parallel by m mappers, which are able to read Ds bytes per second each. Thus, the total reading time is given by: s m · 1 Ds . Lemma 5.2 Shufﬂe Cost—the expected shufﬂe cost of the parallel clustering approach is a function of the number of reducers r to receive the data and the amount of data to be shufﬂed s, which is given by: costS(r, s) = s · Dr r · 1 Ns (5.2) Proof The majority of the shufﬂing cost is related to shipping data between distinct machines through the network. Whenever possible, MapReduce minimizes this cost by assigning reduce tasks to the machines that already have the required data in local disks. Dr is the ratio of data actually shipped between distinct machines relative to the total amount of data processed. Thus, the total amount of data to be shipped is s · Dr bytes. The data will be received in parallel by r reducers, each one receiving in average Ns bytes per second. Thus, the total cost is given by: s·Dr r · 1 Ns . Lemma 5.3 Reduce Cost—the expected cost for the reduce phase is a function of the number of reducers r used for parallel processing and the size s of the data involved, which is: cost R(r, s) = start_up_cost(r) + s r · 1 Ds + plug_in_cost(s r ) (5.3) Proof In the reduce phase, r reducers are started-up at cost start_up_cost(r). Then, the reducers read from disk s bytes in parallel at the individual cost of Ds bytes per second. Thus, the total reading time is s r · 1 Ds . Finally, the plugged-in serial clustering method is executed in parallel over partitions of the data, whose average sizes are s r . Therefore, the approximate clustering cost is plug_in_cost( s r ). Lemma 5.4 ParC Cost—the expected cost of the ParC algorithm is given by: costC = cost M(m, Fs) + costS(r, Fs) + cost R(r, Fs) (5.4) Proof The parallel processing for ParC is as follows: (1) Fs bytes of data are processed in the map phase, by m mappers; (2) Fs bytes of data are shufﬂed to r reducers in the shufﬂing phase; (3) Fs bytes of data are processed in the reduce phase by r reducers, and; (4) a single machine merges all the β-clusters found. The last step has a negligible cost, since it performs simple computations over data amounting to two ﬂoat values per β-cluster, per dimension. Thus, summing the costs of the three initial phases leads to the expected cost for ParC. Lemma 5.5 SnI Cost—the expected cost for the SnI algorithm is given by: 5.3 Cost-Based Optimization of BoW 79 costCs =2 · cost M(m, Fs) + costS(1, Fs · Sr ) + cost R(1, Fs · Sr )+ costS(r, Fs · Rr ) + cost R(r, Fs · Rr ) (5.5) Proof SnI runs two complete map, shufﬂe and reduce phases. In both map phases, the full dataset is processed by m mappers, at combined cost: 2·cost M(m, Fs).Inthe ﬁrst shufﬂe phase, a data sample of size Fs · Sr bytes is shufﬂed to a single reducer, at cost costS(1, Fs · Sr ). The reduce cost to process this sample is: cost R(1, Fs · Sr ). Rr is the ratio of data that does not belong to the major clusters, the ones found in the sampling phase, relative to Fs. That is, Fs · (1 − Rr ) bytes are ignored in the Second Phase of SnI, while Fs · Rr bytes of data are not ignored, being processed after clustering the sample. Both second shufﬂe and reduce phases involve r reducers. Thus, their combined costs are: costS(r, Fs · Rr ) + cost R(r, Fs · Rr ). The costs for shipping and processing β-clusters descriptions is negligible, since the involved amount of data and processing is tiny. Remark: when the parallel clustering algorithms are executed, the number of distinct key values to be sorted by the MapReduce framework is extremely small; it is always the number r of reducers used only. Each reducer handles a single key, so it does not need to do sorting. Thus, the sorting cost is negligible for these approaches. The I/O and network costs are the real bottlenecks. The wall-clock time results measured in all experiments performed (see the upcoming Sect. 5.5) conﬁrm this assertion. Algorithm 8 describes the main steps of BoW. In summary, ParC executes the map, shufﬂe and reduce phases once, involving the full dataset. SnI runs these phases twice, but involving less data. What is the fastest approach? It depends on the environment used. BoW takes the environment description as input and applies cost-based opti- mization to automatically choose the fastest, prior to the real execution. Provided that the clustering accuracies are similar for both approaches (see the upcoming Sect. 5.5 for a complete explanation), BoW actually picks the ‘Best of both Worlds’. Algorithm 8 :TheBest of both Worlds – BoW Method. Input: dataset d S, environmental parameters (Table 5.1), other parameters (Table 5.2), number of reducers r, number of mappers m, sampling ratio Sr Output: clusters 1: compute costC from Equation 5.4; 2: compute costCs from Equation 5.5; 3: if costC > costCs then 4: // use the sampling-and-ignore idea clusters =resultofSnI over d S; 5: else 6: // no sampling clusters =resultofParC over d S; 7: end if 8: return clusters 80 5BoW 5.4 Finishing Touches: Data Partitioning and Cluster Stitching This section describes three reasonable approaches studied for data partitioning and consequent merging and/or stitching of the clusters found in each partition. Notice that BoW works with any of the three partitioning approaches described and, poten- tially, works with any user-deﬁned partitioning strategy. 5.4.1 Random-Based Data Partition The ﬁrst alternative is the Random-Based Data Partition. Mappers randomly assign data elements to reducers, striving for load balance. Each reducer receives a random sample of the dataset, looks for β-clusters on it, and reports the β-clusters it ﬁnds, in terms of their MBRs (Minimum Bounding Rectangles). The ﬁnal step merges every pair of β-clusters that overlap in the data space. Notice that, to spot an overlap, only the descriptions of the β-clusters (MBRs) are needed, and not the elements themselves. Two clusters overlap if they overlap in every axis j. Let uij and lij represent respectively the upper and lower bounds of cluster i at axis j. Similarly, let ui j and li j represent the bounds of cluster i at axis j.Twoβ-clusters i and i overlap if uij ≥ li j ∧ lij ≤ ui j holds for every axis j. Figure 5.3I illustrates a simulation of this process assuming that we have r = 2 reducers. The ﬁrst reducer gets the points indicated as ‘white-circles’ and the second one gets the ‘black-circles’; both reducers run a typical clustering algorithm, returning the MBRs (Minimum Bounding Rectangles) of theβ-clusters they discover (Fig.5.3I, b and c). Then, BoW merges the overlapping β-clusters (Fig. 5.3I, d), and returns the results, indicated as the shaded areas of Fig. 5.3I, e. Notice that some data points may be left as outliers, which is a possibility for all the parallel methods that we describe, as well as for most serial clustering algorithms. 5.4.2 Location-Based Data Partition The second alternative is the Location-Based Data Partition. The idea here is to divide the address space, trading-off load balance to achieve better merging of the β-clusters. Speciﬁcally, BoW partitions the address space into r disjoint regions (say, hyper-rectangles, by bi-secting some coordinate axes), where r is the number of reducers. The mappers are given the boundaries of every region, and direct each element accordingly. In the current implementation, BoW has r to be a power of two, since the partitions are created by dividing each dimension in half as needed. Figure 5.3II illustrates a simulation of the process, using the same toy dataset of Figure 5.3I that we used to illustrate the previous approach. The data elements are assigned to reducers according to their location (vertical dashed line). Again, each of the two reducers generates MBRs of the β-clusters it ﬁnds. Then BoW (a) merges 5.4 Finishing Touches: Data Partitioning and Cluster Stitching 81 (a) example dataset (b) reducer 1 (d) merge input (c) reducer 2 (e) final result x y x y x y x y x y (a) example dataset (b) reducer 1 (d) merge input (c) reducer 2 (e) final result (a) example dataset (b) reducer 1 (d) merge input (c) reducer 2 (e) final result (I) Random-Based Data Partition (II) Location-Based Data Partition (III) File-Based Data Partition x y x y x y x y x y Fig. 5.3 ‘ﬁle-based’ wins. Clustering examples for the three data partitioning approaches. We assume exactly the same 2-dimensional input dataset, with r = 2 reducers. I–left assigns elements at random to reducers, and merges the resulting β-clusters that overlap. II–middle divides the address space in disjoint regions, assigns each region to a reducer, and then either merges, or stitches the appropriate β-clusters (see Sect. 5.4.2 for details). III–right: assigns elements to reducers according to their position in the data ﬁle, and hopes that, due to locality, the resulting β-clusters will have little overlap. As shown in the upcoming Sect. 5.5, the ‘ﬁle-based’ strategy outperforms the ﬁrst two alternatives. those that overlap and (b) stitches those that touch, like the two β-clusters on the top of Fig. 5.3II, d. The stitching step requires a careful design. It intends to stitch together the clusters that touch in partitioned positions with respect to one or more axes, and have “enough touching area” with regard to all other axes. In our running example, Fig. 5.4 shows the input for this step. The β-clusters i and i touch in a partitioned position of axis x. BoW proposes to stitch two β-clusters if the area that they jointly touch is larger than the disjoint areas. In more detail, for the example of Fig. 5.4, the “touching area” of the β-clusters i and i regarding axis y is hi∩i . As in the illustration, let hi and hi be the individual 1-dimensional heights with regard to axis y of the cluster i and i, respectively. BoW considers this “touching area” as being “large enough” for stitching if the common part is larger than the union of the non-common parts, for each axis that do not touch in a partitioned position. It is deﬁned by the following equation. hi∩i >(hi − hi∩i ) + (hi − hi∩i ) (5.6) Notice that the “large enough” criterion is parameter-free. Algorithm 9 gives the full pseudo-code. In our running example, Fig. 5.3II, e shows the ﬁnal output for the merging / stitching procedure, assuming that the upper two β-clusters were stitched. The intermediate set of six β-clusters is summarized into three clusters, represented in three distinct gray levels in the illustration. 82 5BoW Algorithm 9 : Stitching β-clusters i and i. Input: uij and lij, upper and lower bounds of β-cluster i in each axis j ui j and li j , upper and lower bounds of β-cluster i in each axis j Output: merge 1: merge = true; 2: for each axis j do 3: if (not (uij ≥ li j ∧ lij ≤ ui j )) ∧ (not (axis j was partitioned ∧ (uij = li j = partitioned_position ∨ lij = ui j = partitioned_position))) then 4: // do not overlap neither touch in a partitioned position in j 5: merge = false; 6: end if 7: end for 8: if merge then 9: for each axis j do 10: if (uij ≥ li j ∧ lij ≤ ui j ) then 11: compute hi , hi and hi∩i wrt j; 12: if hi∩i ≤ (hi − hi∩i ) + (hi − hi∩i ) then 13: merge = false; // not “enough touching area” in j 14: end if 15: end if 16: end for 17: end if 18: return merge Fig. 5.4 Merging and Stitch- ing for the Location-based approach. Merging:thethree lower-right β-clusters are merged, since they overlap. Stitching:theβ-clusters i and i are stitched to form a bigger cluster, since the height of the “touching area is large enough” compared to the heights of the β-clusters. 5.4.3 File-Based Data Partition The third approach is the File-Based Data Partition. This approach has perfect load balance, assigning the ﬁrst 1/r portion of the records to the ﬁrst reducer, the second 1/r portion to the second one, and so one. The rationale is that it may also facilitate the merging of the β-clusters, because data elements that are stored consecutively on the disk, may be nearby in address space too, due to locality. 5.4 Finishing Touches: Data Partitioning and Cluster Stitching 83 The speciﬁc steps are as follows: BoW intends to divide the input ﬁle into r pieces of nearly equal sizes, whose elements are sequentially stored in the ﬁle. The MapReduce mappers receive the total number of elements η and the total number of reducers r available for parallel processing as input parameters. When an element is received, a mapper takes into account the physical order o of the element in the input ﬁle to deﬁne its appropriate key. The key is computed by the following equation: ﬂoor(o/ceil((η + 1)/r)), assuring an even amount of elements to each partition. Thus, each reducer receives a set of elements sequentially stored in the input ﬁle, and then looks for β-clusters on it. The ﬁnal step of the computation is identical to the random-based data partitioning approach: BoW merges every pair of β-clusters that overlap in the address space. Figure 5.3III illustrates a simulation of the process assuming that we have r = 2 reducers. It follows the same process as in the random-based approach, except for the ﬁrst step, where the data elements are assigned to reducers according to their location in the ﬁle. Assuming locality, we expect most of the black circles to be close in space, and similarly for the white circles. Each reducer reports its MBRs, and then the β-clusters with overlapping MBRs are merged. The hope is that, due to locality, there will be much fewer pairs of overlapping β-clusters than in the random case, while enjoying even better load balancing. 5.5 Experimental Results In this section, we describe the experiments performed to test the algorithms pre- sented in the chapter. The experiments aimed at answering the following questions: Q1 Among the reasonable choices described in Sect. 5.4, what is the best data partitioning approach? Q2 How much (if at all) does the parallelism affect the clustering quality? Q3 How does the parallel clustering method scale-up? Q4 How accurate are the equations used in the cost-based optimization? All experiments used the Hadoop2 implementation for the MapReduce frame- work, on two Hadoop clusters: the M45 by Yahoo! and the DISC/Cloud by Parallel Data Lab in the Carnegie Mellon University. The M45 is one of the top 50 super- computers in the world totaling 400 machines (3,200 cores), 1.5 PB of storage and 3.5 TB of main memory. The DISC/Cloud has 512 cores, distributed in 64 machines, 1TB of RAM and 256 TB of raw disk storage. The algorithm Halite was used as the plugged-in serial clustering method in all experiments. 2 www.hadoop.com 84 5BoW Table 5.3 Summary of datasets Dataset Number of points Number of axes File size YahooEig 1.4 billion 6 0.2TB TwitterEig 62 million 10 14 GB Synthetic up to 100 million 15 up to 14 GB TB Terabytes, GB Gigabytes The methods were tested over the real and synthetic datasets listed in Table 5.3, which are detailed as follows. • YahooEig: The top 6 eigenvectors from the adjacency matrix of one of the largest web graphs. The web graph was crawled by Yahoo!3 in 2002 and contains 1.4 billion nodes and 6.6 billion edges. The eigenvectors amount to 0.2 TB. • TwitterEig: The top 10 eigenvectors from the adjacency matrix of the Twitter4 graph, that represents 62 million users and their relationships. The eigenvectors amount to 14 GB. • Synthetic: A group of datasets with sizes varying from 100 thousand up to 100 million 15-dimensional points, containing 10 clusters each, and no noise. Clusters in subspaces of the original 15-dimensional space were created following stan- dard procedures used by most of the clustering algorithms described in Chap. 3 , including the plugged-in serial clustering method used in the experiments. Specif- ically, Algorithm 6 was used again to generate the synthetic data. Axes-aligned clusters were created. Remember that the clusters generated by Algorithm 6 fol- low normal distributions with random mean and random standard deviation in at least 50% of the axes (relevant axes), spreading through at most 15% of the axes domains. In the other axes, the irrelevant ones, all clusters follow the uniform distribution, spreading through the whole axes domains. Notice one remark: to evaluate how much (if at all) parallelism affects the serial clustering quality, the ideal strategy is to use as ground truth the clustering results obtained by running the plugged-in algorithm serially on any dataset, synthetic or real, and to compare these results to the ones obtained with parallel processing. How- ever, for most of the large datasets analyzed in the experiments, to run a serial algo- rithm (Halite or, potentially, any other serial clustering method for moderate-to-high dimensionality data) is an impractical task—it would require impractical amounts of main memory and/or take a very long time. Thus, in practice, the Synthetic datasets are the only ones from which clustering ground truth is available, and they were used to evaluate the quality of all tested techniques in all experiments performed. For a fair comparison with the plugged-in serial algorithm, the quality is computed following the same procedure used in Chap. 4. That is, the quality is computed by comparing the results provided by each technique to the ground truth, based on the averaged precision and recall of all clusters. 3 www.yahoo.com 4 http://twitter.com/ 5.5 Experimental Results 85 Table 5.4 Environmental parameters for M45 Parameter Value Fs data ﬁle size Ds 40 MB/s Ns 20 MB/s start_up_cost(t) 0.1 t plug_in_cost(s) 1.4E−7 s The File-based Data Partitioning strategy may provide distinct quality results according to the order in which the input data is physically stored. Obviously, the best results appear when the data is totally ordered, i.e., the points of each cluster are sequentially stored in the data ﬁle. On the other hand, when the points are ran- domly distributed in the ﬁle, the qualities tend to be similar to those obtained by the approaches that use the Random-based Data Partitioning. For a fair analysis, each dataset from the Synthetic group was created considering an average case, i.e., 50% of the elements from the totally ordered case were randomly repositioned throughout the data ﬁle. All experiments involving BoW were performed at M45. The parameters used are presented in Table 5.4. Fs refers to the data ﬁle size. Ds, Ns and start_up_cost(t) were inferred by analyzing the logs of the M45 machines, while plug_in_cost(s) was deﬁned based on the time complexity analysis and experiments of the plugged- in method, which were previously presented in Chap. 4 . The results on quality and wall-clock time reported for all experiments are the average of 10 distinct runs. A sample size of nearly one million elements (i.e., Sr = 1 millionη ) was used in all experiments. Also, in every experiment the number of mappers m used was automatically deﬁned by Hadoop. 5.5.1 Comparing the Data Partitioning Strategies This section presents experiments that compare the data partitioning approaches. The experiments intend to answer question Q1: Among the reasonable choices in Sect. 5.4, what is the best data partitioning approach? In order to answer it, ParC was used, the most straightforward parallel algorithm described. This decision aims at avoiding that algorithmic characteristics inﬂuence the results, which should be related to the pros and to the cons of the data partitioning strategies only. That is, the intention is to avoid that the algorithm used leads to biased results. Figures 5.5a and b show the quality of ParC using distinct data partitioning ap- proaches (ParC-F—ﬁle-based, ParC-RÐrandom-based and ParC-LÐlocation-based) versus the wall-clock time over the Synthetic datasets, varying η and r respec- tively. The data sizes vary from 100 thousand to 100 million elements, while the number of reducers r starts at 2 and goes up to 16. The glyph sizes reﬂect the dataset size (Fig. 5.5a) or the number of reducers (Fig. 5.5b). Obviously, the ideal 86 5BoW 0 50 100 150 200 250 300 350 0 20 40 60 80 100 Data Partition Strategies varying r ParC-F ParC-L ParC-R Wall-clock time (seconds) Quality 0 100 200 300 400 500 600 700 0 20 40 60 80 100 Data Partition Strategies varying η ParC-F ParC-L ParC-R Wall-clock time (seconds) Quality (a) (b) Ideal Ideal Fig. 5.5 File-based wins. Quality versus run time for ParC using distinct data partitioning approaches (ParC-F—ﬁle-based yellow triangles, ParC-RÐrandom-based blue squares and ParC-L—location-based orange diamonds). Left 64 reducers, varying the data size η = 100K, 1M, 10M, 100M. Right 10 million elements dataset, varying the number of reducersr = 2, 4, 8, 16. The glyph sizes reﬂect the dataset size (a) or the number of reducers (b). Top-left is the ideal element—notice that ParC-F is consistently closer to it than the others. Thus, the ﬁle-based data partitioning approach is the one recommended to be used with BoW. elements are in the top left of both plots, which represent 100 % quality obtained in zero time. Thus, the strategies were compared by analyzing how close they are to these elements in all cases. Notice that the three strategies present good quality, with some few exceptions for the Location and the Random-based ones. However, the File-based strategy consistently outperformed the others, presenting top quality and being the fastest one in all cases. The other Synthetic datasets generated very similar results. Thus, the File-based Data Partition is the partitioning strategy recommended to be used with BoW. The experiments presented in the rest of this chapter always employ this strategy. We believe that the main reason for the success of the ‘ﬁle-based’ approach is that, when using this approach, the reducers process continuous pieces of the input ﬁle. This helps Hadoop to assign reduce tasks to the machines that already have the required data in local disks, turning the ‘ﬁle-based’ approach into the fastest one. Also, all experiments conﬁrmed the hope that, due to locality, there will be much fewer pairs of overlapping β-clusters than in the random case, while enjoying even better load balancing. 5.5.2 Quality of Results This section presents experiments that aim at answering question Q2: How much (if at all) does the parallelism affect the clustering quality? Figure 5.6 presents the quality results obtained by ParC, SnI and BoW over the Synthetic dataset with 10 million elements. All methods presented top quality, even for large numbers of reducers, like 1,024. Notice, that the serial execution quality of the plugged-in clustering method 5.5 Experimental Results 87 Fig. 5.6 All variants give high quality results. 10 million dataset; quality versus number r of reducers for ParC, SnI and BoW. All methods match the quality of the serial clustering method (top left), for all values of r, like 1,024. The default, ‘ﬁle-based’ partitioning was used for all cases. 1416642561,024 80 85 90 95 100 10 million dataset ParC SnI BoW Number of reducers Quality (percentage) IdealSerial clustering is the one obtained when using a single reducer (r = 1, extreme left elements in the plot). Similar results were observed with all Synthetic datasets. An interesting observation is that the quality may decrease for small datasets, when using a large number of reducers. The obvious reason is that, in those cases, the method is partitioning a small amount of data through a large number of reducers, which actually receive too little data, not enough to represent the patterns existing in the dataset. This fact was conﬁrmed in all experiments, and they lead to the recommendation of using at least ∼150 k points per reducer in average, that is, to set r ≤ η 150 k . According to the experiments, the answer to question Q2 is: as long as you have enough data, the clustering quality is barely affected by the parallelism, even for extremely large numbers of reducers, such as, 1,024. BoW obtained top quality clusters in very little time from all of the very large datasets analyzed. 5.5.3 Scale-Up Results This section presents experiments that aim at answering question Q3: How does the parallel clustering method scale-up? Scale-up results with different numbers of reducers are in Fig. 5.7. Here the TwitterEig eigenvectors and the Synthetic dataset with 100 million points were used. The plots show X-axes as the number of reducers r, and the Y-axes as the relative performance with n reducers compared to using 1 reducer (TwitterEig) or with 4 reducers (Synthetic). A ﬁxed number of mappers m =∼700 was used. The results reported are the average of 10 distinct runs. 4 reducers were picked for the Synthetic dataset, as the running time using just one reducer was impractical. Note that the parallel clustering method exhibits the expected behavior: it starts with near-linear scale-up, and then ﬂattens. Similar scale-up results were obtained for all other datasets. The scale-up results with different data sizes are in Fig. 5.8.TheYahooEig dataset is used. Random samples of the data with increasing sizes, up to the full dataset (1.4 billion elements) were generated to perform this experiment. Wall clock time versus data size is shown. The wall-clock time reported is the average time for 10 88 5BoW 0 20 40 60 80 100 120 140 0 2 4 6 8 10 12 14 16 18 100 million dataset, using ~700 mappers ParC Ideal Number of reducers time 4 reducers / time n reducers 0 20 40 60 80 100 120 140 0 5 10 15 20 25 30 35 40 TwitterEig, using ~700 mappers ParC Ideal Number of reducers time one reducer / time n reducers (a) (b) Fig. 5.7 Near-linear scale-up. Scale-up results regarding the number of reducers r. The parallel clustering method exhibits the expected behavior: it starts with near-linear scale-up, and then ﬂat- tens. Numbers are the average of 10 runs for real and synthetic data. 100 million dataset (left); TwitterEig (right). The X-axes show the number of reducers r, and the Y-axes the relative performance with r reducers compared to using 1 reducer (right)or4reducers(left), in lin-lin scales. Using one reducer in the left case requires prohibitively long time. Number of mappers: m =∼700. The default, ‘ﬁle-based’ partitioning was used for all cases. 0 100 200 300 400 500 600 YahooEig ParC Dataset sizeWall-clock time (seconds) 0.5 billion 1 billion 1.5 billion0 Fig. 5.8 Scale-up results: the parallel clustering method is linear on the dataset size. Wall-clock time (average of 10 runs) versus data size in lin-lin scales. Random samples from YahooEig, up to the full dataset (1.4 billion elements). Fixed number of reducers and mappers (r = 128 and m =∼700). The default, ‘ﬁle-based’ partitioning was used. distinct runs. Fixed numbers of reducers and mappers (r = 128 and m =∼700) were used. As shown, the parallel clustering method has the expected scalability, scaling-up linearly with the data size. It took only ∼8 minutes to cluster the full dataset, which amounts to 200 GB. To provide some context to this result the time taken at different stages in the process is characterized: (a) the mappers took 47 seconds to read the data from disks; (b) 65 seconds were taken to shufﬂe the data; and (c) the reduce stage took 330 seconds. To estimate the time taken by the serial method in item (c), a random sample of the YahooEig dataset, of size Fs r = 0.2TB 128 , was clustered by running the plug-in on a single machine (one core), similar to the ones of the used cloud of computes. The serial clustering time was 192 seconds. This indicates that the plug-in took ∼43 % of the total time. Similar results were obtained for all other datasets. 5.5 Experimental Results 89 (a) (b) Fig. 5.9 BoW wins. Results for real data from Twitter. Wall-clock time versus number of reducers in logÐlog scale. ∼700 MapReduce mappers were used for all runs. Left: ParC (yellow down- triangles)andSnI (green butterﬂies). The latter uses the novel sampling-and-ignore idea; Right: the same, including the method BoW (in red up-triangles). BoW achieves the best of both worlds, using cost-based optimization to pick the winning strategy and good parameters for it, and thus practically over-writes the corresponding curve on the graph. 5.5.4 Accuracy of the Cost Equations Here we present experiments that illustrate the accuracy of the cost formulas shown in Eqs. 5.4 and 5.5 from Sect. 5.3, and the ability of BoW to choose the correct alter- native. Figure 5.9 shows an example of BoW’s results on the TwitterEig dataset. It plots the wall-clock-time (average of 10 runs) versus the number of reducers, in logÐlog scales. Figure 5.9a shows the results for ParC, in yellow down-triangles, and SnI, in green ‘butterﬂy’ glyphs. The latter uses the novel sampling-and-ignore idea. Notice that there is no universal winner, with a cross-over point at about 30 machines for this setting. Figure 5.9b shows exactly the same results, this time including the wall-clock time of the method BoW, in red up-triangles. Notice that BoW locks onto the best of the two alternatives. The reason for its success is the cost-estimation for- mulas from Eqs. 5.4 and 5.5, which help BoW to pick the best alternative and to set good parameters for the chosen environment, while requiring nimble computational effort. Furthermore, notice that the two curves shown in log–log scale in Fig. 5.9a intersect at a narrow angle, which means that the optimal curve has a smooth plateau, and thus the cost is rather robust with respect to small variations of the environment parameters. Figure 5.10 details the results for the Twitter data and also reports results for the Synthetic (100 million points) dataset, in the left and right columns respectively. The six plots give the wall-clock times (average of 10 runs) versus the number of reducers r, in logÐlog scales. The top row (a) and (d) shows that BoW,inred up-triangles, consistently picks the winning strategy among the two options: ParC (yellow down-triangles) and SnI (dark-green butterﬂies). For both datasets BoW gives results so close to the winner that its curve practically overwrites the winner’s curve; the only overhead of BoW is the CPU time required to run the cost equations, which is obviously negligible. 90 5BoW 1 10 100 1.000 30 300 3.000 TwitterEig. using ~700 mappers BoW SnI ParC Number of reducers Wall-clock time (seconds) 1 10 100 1.000 30 300 3.000 TwitterEig. using ~700 mappers ParC predicted Number of reducers Wall-clock time (seconds) 1 10 100 1.000 30 300 3.000 TwitterEig. using ~700 mappers SnI predicted Number of reducers Wall-clock time (seconds) 1 10 100 1.000 30 300 3.000 100m dataset. using ~700 mappers BoW SnI ParC Number of reducers Wall-clock time (seconds) 1 10 100 1.000 30 300 3.000 100m dataset. using ~700 mappers ParC predicted Number of reducers Wall-clock time (seconds) 1 10 100 1.000 30 300 3.000 100m dataset. using ~700 mappers SnI predicted Number of reducers Wall-clock time (seconds) (a) (d) (e) (f) (b) (c) best bestBoW BoW Fig. 5.10 BoW indeed achieves the Best of both Worlds. BoW’s results on the TwitterEig (left) and on the Synthetic 100 million (right) datasets. Time (average of 10 runs) versus number of reducers in logÐlog scale. m is ∼700 for all runs. Top line: illustration of BoW’s ability to pick the winner. Results for ParC (yellow down-triangles), for SnI (green butterﬂies)andforBoW (red up-triangles). Notice that BoW achieves the best of both worlds, consistently choosing the winning strategy, and thus practically over-writing the corresponding curve on those graphs. Bottom two rows: illustration of the accuracy of the Eqs. 5.4 and 5.5 for ParC and SnI respectively. In all cases, the green hour-glass shapes stand for the formulas; notice how close they are to the actual measurements (yellow triangles,anddark-green butterﬂy shapes, respectively) The next two rows of Fig. 5.10 illustrate the accuracy of the cost formulas. Light- green hour-glasses indicate the theoretical prediction; yellow triangles stand for ParC in the middle row, and dark-green butterﬂies stand for SnI in the bottom row. Notice that the theory and the measurements usually agree very well. All other datasets provided similar results. 5.6 Conclusions 91 5.6 Conclusions Given a very large, moderate-to-high dimensionality dataset, how could one cluster its points? For data that do not ﬁt even on a single disk, parallelism is mandatory. In this chapter we described BoW, an algorithm that explores parallelism using MapReduce to cluster huge datasets. The main contributions of BoW are: 1. Algorithm design and analysis: the method BoW includes carefully derived cost functions that allow it to perform the automatic, dynamic trade-off between disk delay and network delay; 2. Effectiveness, scalability and generality: BoW has many desirable features. It can use almost any serial method as a plug-in (the only requirement: clusters described by hyper-rectangles), it uses no user deﬁned parameters (due to its defaults), it matches the clustering quality of the serial algorithm, and it has near- linear scale-up; 3. Experiments: Experiments on both real and synthetic data including billions of points, and using up to 1,024 cores in parallel were performed. To the best of our knowledge, the Yahoo! web is the largest real dataset ever reported in the database clustering literature for moderate-to-high dimensionality data. BoW clustered its 200 GB in only 8 minutes, using 128 cores. Also, the experiments used up to 1,024 cores, which is again the highest such number published in the clustering literature for moderate-to-high dimensionality data. This chapter presented one second algorithm that focuses on data mining in large sets of complex data. The next chapter describes in detail one third and last algorithm focused at tackling this hard problem. References 1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94Ð105 (1998). http://doi.acm.org/10.1145/276305.276314 2. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5Ð33 (2005). http://dx.doi.org/10.1007/ s10618-005-1396-1 3. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: F. Li, M.M. Moro, S. Ghandeharizadeh, J.R. Haritsa, G. Weikum, M.J. Carey, F. Casati, E.Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, V.J. Tsotras (eds.) ICDE, pp. 625Ð636. IEEE (2010). 4. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Halite: Fast and scalable multi- resolution local-correlation clustering. IEEE Transactions on Knowledge and Data Engineering 99(PrePrints) (2011). http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.176. 16 pages 5. Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: C. Apté, J. Ghosh, P. Smyth (eds.) KDD, pp. 690Ð698. ACM (2011). 6. Fayyad, U.: A data miner’s story - getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/ 92 5BoW 7. Moise, G., Sander, J.: Finding non-redundant, statistically signiﬁcant regions in high dimen- sional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533Ð541 (2008). 8. Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414Ð425. IEEE Computer Society (2006). 9. Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273Ð298 (2008). http://dx.doi.org/10.1007/s10115-007-0090-6 10. Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176Ð189 (2005). doi:10.1109/TKDE.2005.29 Chapter 6 QMAS Abstract This chapter describes a work that uses the background knowledge of the clustering algorithms previously presented in the book to focus on two distinct data mining tasks-the tasks of labeling and summarizing large sets of complex data. Given a large collection of complex objects, very few of which have labels, how can we guess the labels of the remaining majority, and how can we spot those objects that may need brand new labels, different from the existing ones? The work presented here provides answers to these questions. Speciﬁcally, this chapter describes in detail QMAS [2], one third algorithm that focuses on data mining in large sets of complex data, which is a fast and scalable solution to the problem of automatically analyzing, labeling and understanding this kind of data. Keywords Complex data · Low-labor labeling · Summarization · Attention rout- ing · Correlation clustering · Random walks with restarts · Satellite imagery analysis 6.1 Introduction The problem of automatically analyzing, labeling and understanding large collections of complex objects appears in numerous ﬁelds. One example application refers to satellite imagery, involving a scenario in which a topographer wants to analyze the terrains in a collection of satellite images. Let us assume that each image is divided into tiles (say, 16 × 16 pixels). Such a user would like to label a small number of tiles (‘Water’, ‘Concrete’, etc), and then the ideal system would automatically ﬁnd labels for all the rest. The user would also like to know what strange pieces of land exist in the analyzed regions, since they may indicate anomalies (e.g., de-forested areas, potential environmental hazards, etc.), or errors in the data collection. Finally, the user would like to have a few tiles that best represent each kind of terrain. R. L. F. Cordeiro et al., Data Mining in Large Sets of Complex Data,93 SpringerBriefs in Computer Science, DOI: 10.1007/978-1-4471-4890-6_6, © The Author(s) 2013 94 6QMAS Forest Fig. 6.1 One example satellite image of Annapolis (MD, USA), divided into 1,024 (32 ×32) tiles, only 4 of which are labeled with keywords, “City” (red), “Water” (cyan), “Urban trees” (green)or “Forest” (black). IKONOS/GeoEye-1 Satellite image courtesy of GeoEye Figure 6.1 illustrates the problem on the example application of satellite images. It shows an example image from the city of Annapolis, MD, USA, decomposed into 1,024 (32×32) tiles, very few (only four) of which were manually labeled as “City” (red), “Water” (cyan), “Urban Trees” (green) or “Forest” (black). With this input set, we want a system that is able to automatically assign the most appropriate labels to the unlabeled tiles, and provide a summarized description of the data by ﬁnding clusters of tiles, the NR best representatives for the data patterns and the top-NO outlier tiles. Similar requirements appear in several other settings that involve distinct types of complex data, such as, social networks, and medical image or biological image applications. In a social network, one user wants to ﬁnd other users that share similar interests with himself/herself or with his/her contacts, while the network adminis- trator wants to spot a few example users that best represent both the most typical and the most strange types of users. In medicine, physicians want to ﬁnd tomogra- phies or x-rays similar to the images of their patient’s as well as a few examples that best represent both the most typical and the most strange image patterns. In biology, given a collection of ﬂy embryos [10] or protein localization patterns [7] or cat retina images [1] and their labels, biologists want a system to answer similar questions. 6.1 Introduction 95 The goals of this chapter are summarized in two research problems: Problem 6.1 low-labor labeling (LLL)-Given an input set I ={I1, I2, I3,...,INI } of NI complex objects, very few of which are labeled with keywords, ﬁnd the most appropriate labels for the remaining ones. Problem 6.2 mining and attention routing - Given an input set I ={I1, I2, I3,...,INI } of NI partially labeled complex objects, ﬁnd clusters, the NR objects from I that best represent the data patterns and the top-NO outlier objects. This chapter describes in detail QMAS: Querying, Mining And Summarizing Multi-dimensional Databases [2]. The method is a fast (O(N)) solution to the afore- mentioned problems. Its main contributions, supported by experiments on real satel- lite images, spanning up to more than 2.25GB, are summarized as follows: 1. Speed: QMAS is fast and it scales linearly on the database size, being up to 40 times faster than top related works on the same subject; 2. Quality: It can do low-labor labeling (LLL), providing results with better or equal quality when compared to top related works; 3. Non-labor intensive: It works even when it is given very few labels - it can still extrapolate from tiny sets of pre-labeled data; 4. Functionality: Contrasting to other methods, QMAS encompasses extra mining tasks such as clustering and outlier and representatives detection as well as sum- marization. It also spots data objects that potentially require new labels; 6.2 Presented Method This section describes QMAS, one fast and scalable solution to the problems of low- labor labeling (LLL) (Problem 6.1) and mining and attention routing (Problem 6.2). It assumes that feature extraction is ﬁrst applied over the input set of complex objects I, turning the set into a multi-dimensional dataset. Next, we detail the method. 6.2.1 Mining and Attention Routing In this section we describe the solution to the problem of mining and attention routing (Problem 6.2). The general idea is as follows: First, QMAS does cluster- ing on the input set of complex objects I; then, it ﬁnds (a) the subset of objects R ={R1, R2, R3,...,RNR }|R ⊆ I that best represent I, and (b) the array with the top-NO outlier objects O = (O1, O2, O3,...,ONO ) | Oo ∈ I, ∀ 1 ≤ o ≤ NO, sorted according to the conﬁdence degree of each object Oo be an outlier. Algo- rithm 10 provides a general view of the solution to Problem 6.2. The details are given as follows. 96 6QMAS Algorithm 10 : QMAS-mining. Input: input set of complex objects I; number of representatives NR; number of top outliers NO . Output: clustering result C; set of representatives R ={R1, R2, R3,...,RNR }|R ⊆ I; top-NO outliers O = (O1, O2, O3,...,ONO ) | Oo ∈ I, ∀ 1 ≤ o ≤ NO ,insortedorder. 1: do clustering on I, let the result be C; 2: R = random NR complex objects from I; 3: error = EQMAS(I, R); // from Equation 6.2 4: repeat 5: improve the representatives in R; 6: old_error = error; 7: error = EQMAS(I, R); // from Equation 6.2 8: until error == old_error 9: O = the NO complex objects from I worst represented by R, sorted according to the conﬁdence degree of each object Oo in O be an outlier; 10: return C, R and O; 6.2.1.1 Clustering The clustering step over the input set of objects I is performed using the algorithm Halite [3, 4]. As described in Chap. 4, Halite is a fast and scalable clustering algo- rithm well-suited to spot clusters in large collections of complex data. QMAS takes advantage of the soft clustering process of Halite to allow a single object to belong to one or more clusters with equal probabilities. This conﬁguration of Halite is used by QMAS to ﬁnd clusters in the objects of I. 6.2.1.2 Finding Representatives Now we focus on the problem of selecting a set R ={R1, R2, R3,...,RNR }|R ⊆ I of objects with cardinality NR =|R| to represent the input set of complex objects I. First, we discuss the desirable properties for a set of representatives, then we describe two possible approaches to actually ﬁnd the representatives. An appropriate set of representatives R for the objects in I must have the following property: there is a large similarity between every object Ii ∈ I and its most similar representative Rr . Obviously, the set of representatives that best represent I is the full set of complex objects, NR = NI ⇒ R = I. In this case, the similarity is maximal between each object Ii ∈ I and its most similar representative Rr , which is the complex object itself, Ii = Rr . However, for NR < NI , how should one evaluate the quality of a given set of representatives? A simple way to evaluate the quality of a collection of representatives is to sum the squared distances between each object Ii and its closest representative Rr .This gives us an error function that should be minimized in order to achieve the best set of representatives R for the input set of complex objects I. It is not a coincidence that this is the same error function minimized by the classic clustering algorithm K-Means [8, 9, 11], which is formally deﬁned by the following equation. 6.2 Presented Method 97 EKM(I, R) = Ii ∈I MIN{Ii − Rr 2 | Rr ∈ R} (6.1) In the equation, Ii − Rr is the distance between the objects Ii and Rr , and MIN is a function that returns the minimum value within its input set of values. Without loss of generality, the Euclidean distance L2 is considered here. Based on this idea, when we ask K-Means for NR clusters, the centroids of the clusters are good indicators of the data space positions where we should look for representatives. Then, we have a set of representatives for K-Means by: (1) ﬁnding, for each centroid, the data object Ii from I that is the closest one to the respective centroid, and; (2) deﬁning R to be the complete set of objects found. Figure 6.2a shows a synthetic dataset containing three clusters. The clusters and their sizes follow skewed distributions. The sizes are 30,000, 3,000 and 1,000 for the clusters in the bottom left, bottom right and top of the data space respectively. 500 points are uniformly distributed through the data space to represent noise. Figure 6.2b shows the representatives selected for the toy dataset by using K-Means and considering NR as 10 (top) and 20 (bottom). The results presented are the best ones over 50 runs, i.e., the ones with the smallest error computed by Eq. 6.1. Notice that, in all cases, the representatives selected are excessively con- centrated in the bottom left cluster, the biggest one, while the other two clusters are poorly represented, having only a few representatives each. These results indicate that K-Means is sensitive to the data distribution, commonly presenting unsatisfactory results for representative picking, especially for skewed data distributions. QMAS proposes to use the traditional, well-known K-Harmonic Means clustering algorithm [12] for representative picking, since it is almost insensitive to skewed 98 6QMAS distributions, data imbalance, and bad seed initialization. Thus, it is a robust way to look for representatives, again by asking for NR clusters and, for each cluster, picking as a representative the object Ii from I that is the closest object to the respective cluster’s centroid. The minimization error function is presented as follows. EQMAS(I, R) = Ii ∈I HAR{Ii − Rr 2 | Rr ∈ R}= Ii ∈I NR Rr ∈R 1 Ii − Rr 2 (6.2) In the equation, Ii − Rr is the distance between the data objects Ii and Rr , and HAR is a function that returns the harmonic mean of its input values. The Euclidean distance L2 is used once more, without loss of generality. Figure 6.2c shows the representatives selected by QMAS for the toy dataset, again considering NR as 10 (top) and 20 (bottom). Once more, the results presented are the best ones over 50 runs, this time considering the error function in Eq. 6.2. Notice that the representatives chosen are now well distributed among the three clusters, providing to the user a summary that better describes the data patterns. 6.2.1.3 Finding the Top-NO Outliers The ﬁnal task related to the problem of mining and attention routing is to ﬁnd the top-NO outliers O = (O1, O2, O3,...,ONO ) | Oo ∈ I, ∀ 1 ≤ o ≤ NO,forthe input set of complex objects I. In other words, O contains the NO objects of I that diverge the most from the main data patterns. The outliers must be sorted in such a way that we identify the top 1st outlier, the top 2nd outlier and so on, according to the conﬁdence degree of each one being an outlier. To achieve this goal, QMAS takes the representatives found in the previous section as a base for the outliers deﬁnition. Assuming that a set of representatives R is a good summary for I,theNO objects from I that are the worst represented by R are said to be the top-NO outliers. Consider again the error function in Eq. 6.2. Notice that the minimized error is the summation of the individual errors for each object Ii ∈ I, where the individual error for Ii is given by the following equation. IEQMAS(Ii , R) = NR Rr ∈R 1 Ii − Rr 2 (6.3) This equation is the harmonic mean of the squared distances between one object Ii and each one of the representative objects in R. The object Ii ∈ I with the greatest individual error is the one that is worst represented by R, which is the object considered to be the top 1st outlier of I.Thetop2nd outlier is the object with the 6.2 Presented Method 99 Fig. 6.3 Top-10 outliers for the toy dataset in Fig. 6.2a, using the QMAS’ representa- tives from Fig. 6.2c(top). As we can see, the top outliers are actually the most extreme cases for this data second greatest individual error, and so on. In this way, QMAS deﬁnes the array O containing the top-NO outliers, in sorted order. Figure 6.3 shows the top-10 outliers that QMAS found for the example dataset in Fig. 6.2a, considering NO = 10 and NR = 10. As we can see, the top outliers are actually the most extreme cases for this data. 6.2.2 Low-Labor Labeling (LLL) In this section we discuss the solution to the problem of low-labor labeling (LLL) (Problem 6.1). That is, given the input set I ={I1, I2, I3,...,INI } of NI complex objects, very few of which are labeled with keywords, how to ﬁnd the most appro- priate labels for the remaining ones. In order to tackle this problem, QMAS ﬁrst represents the input complex objects and labels as a graph G, which is named as the Knowledge Graph. Then, random walks with restarts over G allow QMAS to ﬁnd the most suitable labels for each unlabeled object. Algorithm 11 provides a general view of the solution to Problem 6.1. The details are given as follows. G is a tri-partite graph composed of a set of vertexes V and a set of edges X, i.e., G = (V, X). To build the graph, the input sets of complex objects I and known labels L are used, as well as the clustering results obtained in Sect. 6.2.1. V consists of one vertex for each data object, for each cluster, and for each label. The edges link complex objects to their respective clusters and labels. Let V (Ii ) and V (Ll) represent the vertexes of G related to object Ii and to label Ll respectively. Provided the clustering results for the objects in I, the process of building G is simple, having linear time and memory complexities on the number of objects, labels and clusters. Figure 6.4 shows the Knowledge Graph G for a small example dataset with seven complex objects, two labels, and three clusters. Data objects, labels, and clusters 100 6QMAS Algorithm 11 : QMAS-labeling. Input: input collection of complex objects I; collection of known labels L; restart probability c; clustering result C. // from Algorithm 10 Output: full set of labels LF. 1: use I, L and C to build the Knowledge Graph G; 2: for each unlabeled object Ii ∈ I do 3: do random walks with restarts in G,usingc, always restarting the walk from vertex V (Ii ); 4: compute the afﬁnity between each label in the collection L and the object Ii .LetLl be the label with the biggest afﬁnity to Ii ; 5: set in LF: Ll is the appropriate label for object Ii ; 6: end for 7: return LF; Clusters Data objects Labels Fig. 6.4 The knowledge graph G for a toy dataset. Squares, circles,andtriangles represent data objects, labels, and clusters respectively. The edges link objects to their corresponding clusters and known labels are represented by nodes with shape of squares, circles, and triangles, respectively. The graph indicates, for example, that cluster C1 contains the objects I1, I2, and I3. Object I3 also belongs to cluster C2 in this setting. In addition, the graph shows that object I1 has the known label L1, while the objects I4 and I7 have the known label L2. In order to look for the most suitable label for an unlabeled complex object Ii , QMAS uses random walks with restarts over graph G. This process works as follows: a random walker starts from vertex V (Ii ). At each step, the walker either goes back to the initial vertex V (Ii ), with probability c, or to a randomly chosen vertex that shares an edge with the current vertex, with probability 1 − c.Thevalueofc is user deﬁned, and may be determined by cross validation. The probability of choosing a neighboring vertex is proportional to the degree of that vertex, i.e., the walker favors smaller clusters and more speciﬁc labels in this process. The afﬁnity between Ii and a label Ll is given by the steady state probability that the random walker will ﬁnd itself at vertex V (Ll), always restarting from V (Ii ). Finally, the label Ll with the largest afﬁnity with object Ii is taken as the most suitable label for Ii . The intuition behind this procedure is that the steady state probability that a random walker will ﬁnd itself in vertex V (Ll), always restarting the walk from 6.2 Presented Method 101 vertex V (Ii ), is a way to measure the closeness between Ii and Ll. If the computed probability is high, the vertexes are probably linked by short paths. On the other hand, if the probability is low, it is likely that no short path links them. This idea can be better understood through the example in Fig. 6.4. Let us assume that we want to ﬁnd the most appropriate label for object I2. There is a high probability that a random walker will reach V (L1), always restarting the walk from V (I2), mainly because there exists a three-step path linking V (I2) to V (L1). On the other hand, there is a lower probability that the walker will ﬁnd itself at V (L2), always restarting the walk from V (I2), especially because the shortest path between V (I2) and V (L2) has seven steps. It leads us to conclude that the most appropriate label for I2 is L1. 6.3 Experimental Results This section discusses the experiments performed to test the QMAS algorithm. Large collections of satellite images were analyzed to validate the method. First, results on the initial example from the introductory Sect. 6.1 are reported. Then, we discuss the experiments performed to support the contributions of QMAS stated in that section, regarding its Speed, Quality, Non-labor intensive capability, and Functionality. Three real satellite image sets were analyzed. They are described as follows: • GeoEye1—this public dataset contains 14 high-quality satellite images in the jpeg format extracted from famous cities around the world, such as the city of Annapolis (MD, USA), illustrated in Fig. 6.1. The total data size is about 17 MB. Each image was divided into equal-sized rectangular tiles and the entire dataset contains 14,336 tiles, from which Haar wavelets features in 2 resolution levels were extracted, plus the mean value of each band of the tiles; • SAT1.5GB—this proprietary dataset has 3 large satellite images of around 500 MB each in the GeoTIFF lossless data format. The total data size is about 1.5 GB. Each image was divided into equal-sized rectangular tiles. The 3 images combined form a set of 721,408 tiles, from which Haar wavelets features in 2 resolution levels were extracted, plus the mean value of each band of the tiles; • SATLARGE—this proprietary dataset contains a pan QuickBird image of size 1.8GB, and its matching 4-band multispectral image of size 450 MB. These images were combined and 2,570,055 hexagonal tiles generated, from which mean, vari- ance, moments and GBT texture features [5] were extracted. The ﬁnal feature set of a tile comprises a 30-dimensional vector. The experimental environment is a server with Fedora¨ Core 7 (Red Hat, Inc.), a2.8GHz core and 4GB of RAM. QMAS was compared with one of the best com- petitors: the GCap method that we described in Sect. 2.4. GCap was implemented in two versions with different nearest neighbor ﬁnding algorithms: one version uses the basic quadratic algorithm (GCap) and one other version spots approximate nearest 1 The data is publicly available at: ‘geoeye.com’. 102 6QMAS neighbors (GCap-ANN), using the ANN Library.2 The number of nearest neighbors is set to seven in all experiments. All three approaches share the same implementa- tion of the Random Walks with Restarts algorithm using the power iteration method [6], with the restart parameter set as c = 0.15. 6.3.1 Results on the Initial Example This section discusses the results obtained for the example satellite image from Fig. 6.1, presented in the introductory Sect. 6.1. The image, also shown in Fig. 6.5a, refers to the city of Annapolis, MD, USA. As in the introductory example, it was decomposed into 1,024 (32 × 32) tiles, only four of which were manually labeled as “City” (red), “Water” (cyan), “Urban Trees” (green) or “Forest” (black). From each tile Haar wavelets features in 2 resolution levels were extracted, plus the mean value of each band of the tile. Figure 6.5b shows the solution proposed by QMAS to the problem of low-labor labeling (LLL) (Problem 6.1) on the example satellite image. Notice two remarks: (a) the vast majority of the tiles were correctly labeled and (b) there are few outlier tiles marked in yellow that QMAS judges as too different from the labeled ones (i.e., there is no path linking the image and one label in the Knowledge Graph), and thus are returned to the user as outliers that potentially deserve a new label of their own. Closer inspection shows that the outlier tiles tend to be on the border of, say, “Water” and “City” (because they contain a bridge). The solution to the problem of mining and attention routing (Problem 6.2) on the example image is presented in Fig. 6.5c and d. QMAS pointed out the 3 tiles that best represent the data patterns and the top-2 outliers. Notice that the representatives actually cover the 3 major keywords (“City”, “Urban Trees”, and “Water”), while the top outliers are hybrid tiles, like the bottom right which is a bridge (both “Water” and “City”). Note that QMAS goes even further by summarizing the results: besides represen- tatives and top outliers, QMAS found data clusters, ignoring the user-provided labels. This has two advantages. The ﬁrst is that it indicates to the user what, if any, changes have to be done to the labels: new labels may need to be created (to handle some clusters or outliers), and/or labels may need to be merged (e.g., “Forest” and “Urban trees”), and/or labels that are too general may need to be divided in two or more (“Shallow Water” and “Deep Sea”, instead of just “Water”). The second advantage is that these results can also be used for group labeling, since the user can decide to assign labels to entire clusters rather than labeling tiles one at a time. 2 http://www.cs.umd.edu/~mount/ANN/ 6.3 Experimental Results 103 Fig. 6.5 Solution to the problems of low-labor labeling and mining and attention routing on an example satellite image. Top left: the input image of Annapolis (MD, USA), divided into 1,024 (32 × 32) tiles, only 4 of which are labeled with keywords (“City” in red, etc). Top right: the labels that QMAS proposes; yellow indicates outliers. Bottom left: the 3 tiles that best represent the data, which actually cover the 3 major keywords. Bottom right: the top-2 outlier tiles, where appropriate labels do not exist (hybrid tiles, like the bottom right which is a bridge = both “Water” and “City”). IKONOS/GeoEye-1 Satellite image courtesy of GeoEye 6.3.2 Speed This section discusses experiments that support the following claim: QMAS is a fast solution to the problems investigated, scaling linearly on the data size, and being several times faster than top related works. Figure 6.6 shows how the tested methods scale with increasing data sizes. Random samples from the SAT1.5GB dataset were used. As it can be seen, the log-log curve for QMAS has the slope equal to one, so QMAS scales linearly with the input data size, while the slope of log-log curves are 2.1 and 1.5 for GCap and GCap-ANN, respectively. For the full SAT1.5GB dataset, QMAS is 40 times faster than GCap-ANN, while running GCap would take hours long (not shown in the ﬁgure). 104 6QMAS Fig. 6.6 Time versus number of tiles for random samples of the SAT1.5GB dataset. QMAS: red circles;GCap: blue crosses; GCap-ANN: green diamonds. Wall-clock time results are averaged over 10 runs; Error bars are too small to be shown Notice oneimportantremark: as stated in Sect.2.4, most previous works, including GCap, search for nearest neighbors in the feature space. This operation is super-linear even with the use of approximate nearest-neighbor ﬁnding algorithms. On the other hand, QMAS avoids the nearest neighbor searches by using clusters to link similar image nodes in the Knowledge Graph. It allows QMAS to scale linearly on the data size, being up to 40 times faster than the top competitors. 6.3.3 Quality and Non-labor Intensive This section discusses experiments that support the ability of QMAS to return high- quality results and to be non-labor intensive. For the experiments, 256 tiles in the SAT1.5GB dataset were labeled via manual curation. Some few ground truth labels were randomly selected from each class as the input labels and the remaining ones were used for one quality test. Figure 6.7 illustrates the labeling accuracy for the GCap-ANN and for the QMAS approaches in box plots obtained from 10 repetitive runs. As it can be seen, QMAS does not sacriﬁce quality for speed compared with GCap-ANN and it performs even better when the pre-labeled data size is limited. Note that the accuracy of QMAS is barely affected by the number of the pre-labeled examples in each label class, when the number of examples given goes above 2, while the quality of GCap-ANN was considerably worse with small sets of pre-labeled examples. The fact that QMAS can still extrapolate from tiny sets of pre-labeled data ensures its non-labor intensive capability. 6.3.4 Functionality This section discusses experiments that support the following claim: in contrast to the related works, QMAS includes other mining tasks such as clustering, detection 6.3 Experimental Results 105 Fig. 6.7 Comparison of approaches in box plots— quality versus size of the pre-labeled data. Top left is the ideal point. QMAS: red circles; GCap-ANN: green diamonds. Accuracy values of QMAS are barely affected by the size of the pre-labeled data. Results are obtained over 10 runs of top outliers and representatives, besides summarization. In other words, QMAS tackles both the problem of low-labor labeling (LLL) (Problem 6.1) and the problem of mining and attention routing (Problem 6.2), while the related works address only the former. To evaluate this claim, the functionality of QMAS was analyzed regarding its ability to spot clusters, representatives and top outliers. The GeoEye dataset was used in all experiments of this section. Figure 6.8 shows some screenshots of the clustering results obtained with QMAS. Yellow tiles represent outliers. Closer inspection shows that these outlier tiles tend to be on the border of areas like “Water” and “City” (because they contain a bridge). The remaining tiles are colored according to its cluster. As expected, a few tiles belong to more than one cluster, since QMAS does soft clustering. These tiles were colored Fig. 6.8 Clustering results provided by QMAS for the GeoEye dataset. Top: the real satellite images; bottom:the corresponding results, shown by coloring each tile after its cluster. Yellow tiles rep- resent outliers. Notice that the clusters actually repre- sent the main data patterns. IKONOS/GeoEye-1 Satellite image courtesy of GeoEye 106 6QMAS Fig. 6.9 Left: NR = 6 representatives found by QMAS for the GeoEye dataset, colored after their clusters. Right: Top-3 outliers for the same dataset, found using the representatives shown. Notice that the 3 outliers together with the 6 representatives found, only 9 tiles in total, nicely summarize the GeoEye dataset, which contains more than 14 thousand tiles. IKONOS/GeoEye-1 Satellite image courtesy of GeoEye according to their ﬁrst assigned clusters. Notice: the clustering results reported indeed represent the main patterns apparent in the analyzed images. Figure 6.9 illustrates the results obtained by QMAS for representatives (left) and top outliers (right). NR = 6 representatives are shown, colored according to their clusters. Note that these few representatives cover the main clusters previously pre- sented in Fig. 6.8. Also, these 6 representatives were used as a basis to the detection of the top-3 outliers shown in the ﬁgure. The outlier tiles tend to be on the border of areas like “Water” and “City” (because they contain a bridge). By comparing these results with the clusters shown in Fig. 6.8, one can easily notice that the 3 outliers spotted, together with the 6 representatives found (only 9 tiles in total) properly summarize the GeoEye dataset, which has more than 14, 000 tiles. 6.3.5 Experiments on the SATLARGE Dataset Here we discuss results for the SATLARGE dataset, related to query by example experiments; i.e., given a small set of tiles (examples), manually labeled with one keyword, query the unlabeled tiles to ﬁnd the ones most likely related to that keyword. Figure 6.10 illustrates the results obtained for several categories (“Water”, “Houses”, “Trees”, “Docks”, “Boats” and “Roads”) to show that QMAS returns high-quality 6.3 Experimental Results 107 Fig. 6.10 Examples with “Water”, “Houses”, “Trees”, “Docks”, “Boats” and “Roads”: labeled data and the results of queries aimed at spotting other tiles of these types. IKONOS/GeoEye-1 Satellite image courtesy of GeoEye results, being almost insensitive to the kind of tile given as input. Also, notice in the experiments related to “Docks” and “Boats” that the results are correct even for tiny sets of pre-labeled data. The number of examples provided vary from as many as ∼50 examples to as few as two examples. Varying the amount of labeled data allows one to observe how the system responds to these changes. In general, labeling only a small number of examples (even less than ﬁve) still leads to pretty accurate results. Finally, notice that correct results often look very different from the given examples, i.e., QMAS is able to extrapolate from the given examples to other, correct tiles that do not have signiﬁcant resemblance to the pre-labeled set. 108 6QMAS 6.4 Conclusions In this chapter we described QMAS: Querying, Mining And Summarizing Multi- dimensional Databases, one third algorithm that focuses on data mining in large sets of complex data. Speciﬁcally, QMAS is a fast solution to the problem of automatically analyzing, labeling and understanding this kind of data. Its main contributions, sup- ported by experiments on real satellite images spanning up to 2.25GB, are presented as follows: 1. Speed: QMAS is a fast solution to the problems presented, and it scales linearly on the database size. It is up to 40 times faster than top related works (GCap) on the same subject; 2. Quality: It can do low-labor labeling (LLL), providing results with accuracy better than or equal to the accuracy of the related works; 3. Non-labor intensive: It works even when it is given very few labels—it can still extrapolate from tiny sets of pre-labeled data; 4. Functionality: In contrast to the other methods, QMAS spots data objects that potentially require new labels, and encompasses other mining tasks such as clus- tering, outlier and representatives detection, as well as summarization; The next chapter presents the conclusions of this book. References 1. Bhattacharya, A., Ljosa, V., Pan, J.Y., Verardo, M.R., Yang, H.J., Faloutsos, C., Singh, A.K.: Vivo: visual vocabulary construction for mining biomedical images. In: ICDM, pp. 50Ð57. IEEE Computer Society (2005) 2. Cordeiro, R.L.F., Guo, F., Haverkamp, D.S., Horne, J.H., Hughes, E.K., Kim, G., Traina, A.J.M., Traina Jr., C., Faloutsos, C.: Qmas: querying, mining and summarization of multi- modal databases. In: Webb, G.I., Liu, B., Zhang, C., Gunopulos, D., Wu, X. (eds.) ICDM, pp. 785Ð790. IEEE Computer Society (2010) 3. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R., Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) ICDE, pp. 625Ð636. IEEE (2010) 4. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Halite: fast and scalable multi- resolution local-correlation clustering. IEEE Trans. Knowl. Data Eng. 99(PrePrints) (2011). doi:10.1109/TKDE.2011.176 5. Gibson, L., Lucas, D.: Spatial data processing using generalized balanced ternary. In: IEEE conference on pattern recognition and image analysis (1982) 6. Golub, G.H., Van Loan, C.F.: Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore, USA (1996) 7. Huang, K., Murphy, R.F.: From quantitative microscopy to automated image understanding. J. Biomed. Optics 9, 893Ð912 (2004) 8. Lloyd, S.: Least squares quantization in pcm. IEEE Trans Info Theory 28(2), 129Ð137 (1982). doi:10.1109/TIT.1982.1056489 References 109 9. MacQueen, J.B.: Some methods for classiﬁcation and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the ﬁfth berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281Ð297. University of California Press, California (1967) 10. Pan, J.Y., Balan, A.G.R., Xing, E.P., Traina, A.J.M., Faloutsos, C.: Automatic mining of fruit ﬂy embryo images. KDD pp. 693Ð698 (2006) 11. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1, 801Ð804 (1956). (in French) 12. Zhang, B., Hsu, M., Dayal, U.: K-harmonic means—a spatial clustering algorithm with boosting. In: Roddick, J.F., Hornsby, K. (eds.) TSDM, lecture notes in computer science, vol. 2007, pp. 31Ð45. Springer, Heidelberg (2000) Chapter 7 Conclusion Abstract This book was motivated by the increasing amount and complexity of the dada collected by digital systems in several areas, which turns the task of knowl- edge discovery out to an essential step in businesses’ strategic decisions. The mining techniques used in the process usually have high computational costs and force the analyst to make complex choices. The complexity stems from the diversity of tasks that may be used in the analysis and from the large amount of alternatives to execute each task. The most common data mining tasks include data classiﬁcation, labeling and clustering, outlier detection and missing data prediction. The large computa- tional cost comes from the need to explore several alternative solutions, in different combinations, to obtain the desired information. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, au- dio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making the traditional techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. We discussed new data mining techniques for large sets of complex data, especially for the clustering task tightly associated to other mining tasks that are performed together. Speciﬁcally, this book described in detail three novel data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering [11, 13]; the method BoW for clustering Terabyte-scale datasets [14]; and the method QMAS for labeling and summarization [12]. Keywords Big data · Complex data · Correlation clustering · Low-labor labeling · Summarization · Attention routing · Linear or Quasi-linear complexity · Terabyte- scale data analysis R. L. F. Cordeiro et al., Data Mining in Large Sets of Complex Data, 111 SpringerBriefs in Computer Science, DOI: 10.1007/978-1-4471-4890-6_7, © The Author(s) 2013 112 7 Conclusion 7.1 Main Contributions Three data mining techniques were described in detail in this book. These techniques were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of mag- nitude faster than the fastest related works in almost all cases. The real life datasets used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo!1 and also on the graph with all users and their connections from the Twitter2 social network. The three techniques are brieﬂy discussed as follows. 1. The Method Halite for Correlation Clustering: the algorithm Halite [11, 13] is a fast and scalable density-based clustering algorithm for data of medium dimensionality able to analyze large collections of complex data elements. It creates a multi-dimensional grid all over the data space and counts the number of points lying at each hyper-cubic cell provided by the grid. A hyper-quad-tree-like structure, called the Counting-tree, is used to store the counts. The tree is thereafter submitted to a ﬁltering process able to identify regions that are, in a statistical sense, denser than its neighboring regions regarding at least one dimension, which leads to the ﬁnal clustering result. Halite is fast and has linear or quasi-linear time and space complexity on both data size and dimensionality. 2. The Method BoW for Clustering Terabyte-scale Data: the method BoW [14] focuses on clustering Terabytes of moderate-to-high dimensionality data, such as features extracted from billions of complex data elements. In these cases, a serial processing strategy is usually impractical. Just to read a single Terabyte of data (at 5GB/min on a single modern eSATA disk) one takes more than 3 hours. BoW explores parallelism through MapReduce and can treat as plug-in almost any of the serial clustering methods, including the algorithm Halite. The major research challenges addressed are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be the bottleneck. BoW automatically spots the bottleneck and chooses a good strategy, one of them uses a novel sampling-and-ignore idea to reduce the network trafﬁc. 3. The Method QMAS for Labeling and Summarization: QMAS [12]isafast and scalable solution to the following problems: (a) Low-labor labeling,given a large set of complex objects, very few of which are labeled with keywords, ﬁnd the most suitable labels for the remaining ones; and (b) Mining and attention routing, in the same setting, ﬁnd clusters, the top-NO outlier objects, and the top-NR representative objects. The algorithm is fast and it scales linearly with the data size, besides working even with tiny initial label sets. 1 www.yahoo.com 2 twitter.com 7.2 Discussion 113 7.2 Discussion In the previous Chap.3 we brieﬂy describe representative methods from literature aimed at spotting clusters in moderate-to-high dimensionality data. Then, we con- clude the chapter by summarizing in Table 3.1 relevant methods with regard to the main desirable properties that any clustering technique designed to analyze such kind of data should have. Here, we reprint in Table 7.1 the same table presented in Chap. 3, this time including the new methods described in the book. Notice that Table 7.1 was inspired3 in one table found in [17]. Remember that the initial analysis of the literature provided in Chap.3 lead us to come to one main conclusion. In spite of the several qualities found in the existing works, to the best of our knowledge, there is no method published in the literature, and designed to look for clusters in subspaces, that has any of the following desirable properties: (1) to scale linearly or quasi-linearly in terms of memory requirement and execution time with regard to increasing numbers of points and axes, besides increasing clusters’ dimensionalities, and; (2) to be able to handle data of Terabyte- scale in feasible time. One central goal of the work described in this book is to overcome these two limi- tations. Speciﬁcally, in Chap.4, we focused on the former problem identiﬁed - linear or quasi-linear complexity. We described in detail the method Halite, a novel cor- relation clustering algorithm for multi-dimensional data, whose main strengths are that it is fast and it has linear or quasi-linear scalability in time and space with regard to increasing numbers of objects and axes, besides increasing clusters’ dimension- alities. Therefore, the method Halite tackles the problem of linear or quasi-linear complexity. A theoretical study on the time and space complexity of Halite,dis- cussed in Sect.4.3, as well as an extensive experimental evaluation performed over synthetic and real data spanning up to 1 million elements and comparing Halite with seven representative works corroborate this claim. In Chap.5, we focused on the later problem - Terabyte-scale data analysis. We described in detail the method BoW, a novel, adaptive clustering method that explores parallelism using MapReduce for clustering huge datasets. It combines (a) potentially any serial algorithm used as a plug-in and (b) makes the plug-in run efﬁciently in parallel, by adaptively balancing the cost for disk accesses and network accesses, which allows BoW to achieve a very good tradeoff between these two possible bottlenecks. Therefore, BoW tackles the problem of Terabyte-scale data analysis. Experiments performed on both real and synthetic data with billions of points, and using up to 1, 024 cores in parallel corroborate this claim. Finally, notice in Table 7.1 that the use of the Halite algorithm as a plug-in for the BoW method creates one powerful tool for clustering moderate-to-high dimensionality data of large scale-this conﬁguration has both the desirable properties 3 Table 7.1 includes a summary of one table found in [17], i.e., Table 7.1 includes a selection of most relevant desirable properties and most closely related works from the original table. Table 7.1 also includes two novel desirable properties not found in [17]—Linear or quasi-linear complexity and Terabyte-scale data analysis. 114 7 Conclusion Table 7.1 Properties of methods aimed at clustering moderate-to-high dimensionality data, including the new methods Halite and BoW (using Halite as plug-in) Clustering Algorithm Arbitrarily oriented clusters Not relying on the locality assumption Adaptive density threshold Independent of the order of the attributes Independent of the order of the objects Deterministic Arbitrary number of clusters Overlapping clusters (soft clustering) Arbitrary subspace dimensionality Avoiding complete enumeration Robust to noise Linear or quasi-linear complexity Terabyte-scale data analysis Axes parallel clustering CLIQUE [6, 7] ENCLUS [10] SUBCLU [18] PROCLUS [5] PreDeCon [8] P3C [19, 20] COSA [15] DOC/FASTDOC [21] FIRES [16] Correlation clustering ORCLUS [3, 4] 4C [9] COPAC [2] CASH [1] na Halite [11, 13] BoW [14] Notice that this table was inspired in one table found in [17]. na: not applicable sought, linear or quasi-linear complexity and Terabyte-scale data analysis, still having most of the other properties that any clustering technique designed to analyze moderate-to-high dimensionality data should have. In summary, the work described in this book takes steps forward from traditional data mining (especially for clustering) by considering large, complex datasets. Note that, usually, current works focus in one aspect, either data size or complexity. The work described here considers both: it enables mining complex data from high impact applications, such as breast cancer diagnosis, region classiﬁcation in satellite images, assistance to climate change forecast, recommendation systems for the Web and social networks; the data are large in the Terabyte-scale, not in Giga as usual; and very accurate results are found in just minutes. Thus, it provides a crucial and well 7.2 Discussion 115 timed contribution for allowing the creation of real time applications that deal with Big Data of high complexity in which mining on the ﬂy can make an immeasurable difference, like to support cancer diagnosis or deforestation detection. References 1. Achtert, E., Böhm, C., David, J., Kröger, P., Zimek, A.: Global correlation clustering based on the hough transform. Stat. Anal. Data Min 1, 111–127 (2008). doi:10.1002/sam.v1:3 2. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efﬁcient correlation clustering. SDM, USA (2007) 3. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec. 29(2), 70–81 (2000). doi:10.1145/335191.335383 4. Aggarwal, C., Yu, P.: Redeﬁning clustering for high-dimensional applications. IEEE TKDE 14(2), 210–225 (2002). doi:10.1109/69.991713 5. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Rec. 28(2), 61–72 (1999). doi:10.1145/304181.304188 6. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi:10.1145/276305.276314 7. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). doi:10.1007/s10618-005- 1396-1 8. Bohm, C., Kailing, K., Kriegel, H.P., Kroger, P.: Density connected clustering with local sub- space preferences. In: ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 27–34. IEEE Computer Society, USA (2004). 9. Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: SIGMOD, pp. 455–466. USA (2004). http://doi.acm.org/10.1145/1007568. 1007620 10. Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD, pp. 84–93. NY, USA (1999). http://doi.acm.org/10.1145/312129.312199 11. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr, C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R., Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) pp. 625–636. IEEE In ICDE. (2010). 12. Cordeiro, R.L.F., Guo, F., Haverkamp, D.S., Horne, J.H., Hughes, E.K., Kim, G., Traina, A.J.M., Traina Jr., C., Faloutsos, C.: Qmas: Querying, mining and summarization of multi- modal databases. In: Webb, G.I., Liu, B., Zhang, C., Gunopulos, D., Wu, X. (eds.) ICDM, pp. 785–790. IEEE Computer Society (2010). 13. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Halite: Fast and scalable multi- resolution local-correlation clustering. IEEE Trans. Knowl. Data Eng. 99(PrePrints) (2011). doi:10.1109/TKDE.2011.176. 14. Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: C. Apté, J. Ghosh, P. Smyth (eds.) KDD, pp. 690–698. ACM (2011). 15. Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. Roy. Stat. Soc. Ser. B 66(4), 815–849 (2004). doi:a/bla/jorssb/v66y2004i4p815-849.html 16. Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efﬁcient subspace clustering of high-dimensional data. In: ICDM, pp. 250–257. Washington, USA (2005). http:// dx.doi.org/10.1109/ICDM.2005.5 116 7 Conclusion 17. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009). doi:10.1145/1497577.1497578 18. Kröger, P., Kriegel, H.P., Kailing, K.: Density-connected subspace clustering for high- dimensional data. SDM, USA (2004) 19. Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006). 20. Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273–298 (2008). doi:10.1007/s10115-007-0090-6 21. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427. USA (2002). http://doi.acm.org/10.1145/ 564691.564739

贡献于2013-11-29