P({a, b}). Furthermore, c({a}−→{b})=P({a, b}) P({a}) = P({b}) − P({a, b}) 1 − P({a}) i. Therefore, we may write c({a}−→{b}) − c({a}−→{b})=P({b}) − P({a, b}) 1 − P({a}) − P({a, b}) P({a}) = P({a})P({b}) − P({a, b}) P({a})(1 − P({a})) which is positive because P({a})P({b}) >P({a, b}). ii. We can also show that c({a}−→{b}) − s({b})=P({b}) − P({a, b}) 1 − P({a}) − P({b}) = P({a})P({b}) − P({a, b}) 1 − P({a}) is always positive because P({a})P({b}) >P({a, b}). 18. Table 6.5 shows a 2 × 2 × 2 contingency table for the binary variables A and B at diﬀerent values of the control variable C. (a) Compute the φ coeﬃcient for A and B when C =0,C =1,andC =0 or 1. Note that φ({A, B})= P (A,B)−P (A)P (B)√ P (A)P (B)(1−P (A))(1−P (B)) . Answer: i. When C =0,φ(A, B)=−1/3. ii. When C =1,φ(A, B)=1. iii. When C =0orC =1,φ =0. (b) What conclusions can you draw from the above result? Answer: The result shows that some interesting relationships may disappear if the confounding factors are not taken into account. 93 Table 6.5. A Contingency Table. A C = 0 C = 1 B B 1 1 0 0 0 5 1 15 0 15 0 0 30 15 Table 6.6. Contingency tables for Exercise 19. B BBB A 9 1 A 89 1 A 1 89 A 1 9 (a) Table I. (b) Table II. 19. Consider the contingency tables shown in Table 6.6. (a) For table I, compute support, the interest measure, and the φ correla- tion coeﬃcient for the association pattern {A, B}. Also, compute the conﬁdence of rules A → B and B → A. Answer: s(A)=0.1, s(B)=0.9, s(A, B)=0.09. I(A, B)=9,φ(A, B)=0.89. c(A −→ B)=0.9, c(B −→ A)=0.9. (b) For table II, compute support, the interest measure, and the φ correla- tion coeﬃcient for the association pattern {A, B}. Also, compute the conﬁdence of rules A → B and B → A. Answer: s(A)=0.9, s(B)=0.9, s(A, B)=0.89. I(A, B)=1.09, φ(A, B)=0.89. c(A −→ B)=0.98, c(B −→ A)=0.98. (c) What conclusions can you draw from the results of (a) and (b)? Answer: Interest, support, and conﬁdence are non-invariant while the φ-coeﬃcient is invariant under the inversion operation. This is because φ-coeﬃcient 94 Chapter 6 Association Analysis takes into account the absence as well as the presence of an item in a transaction. 20. Consider the relationship between customers who buy high-deﬁnition televi- sions and exercise machines as shown in Tables 6.19 and 6.20. (a) Compute the odds ratios for both tables. Answer: For Table 6.19, odds ratio = 1.4938. For Table 6.20, the odds ratios are 0.8333 and 0.98. (b) Compute the φ-coeﬃcient for both tables. Answer: For table 6.19, φ =0.098. For Table 6.20, the φ-coeﬃcients are -0.0233 and -0.0047. (c) Compute the interest factor for both tables. Answer: For Table 6.19, I =1.0784. For Table 6.20, the interest factors are 0.88 and 0.9971. For each of the measures given above, describe how the direction of associa- tion changes when data is pooled together instead of being stratiﬁed. Answer: The direction of association changes sign (from negative to positive corre- lated) when the data is pooled together. 7 Association Analysis: Advanced Concepts 1. Consider the traﬃc accident data set shown in Table 7.1. Table 7.1. Trafﬁc accident data set. Weather Driver’s Traﬃc Seat Belt Crash Condition Condition Violation Severity Good Alcohol-impaired Exceed speed limit No Major Bad Sober None Yes Minor Good Sober Disobey stop sign Yes Minor Good Sober Exceed speed limit Yes Major Bad Sober Disobey traﬃc signal No Major Good Alcohol-impaired Disobey stop sign Yes Minor Bad Alcohol-impaired None Yes Major Good Sober Disobey traﬃc signal Yes Major Good Alcohol-impaired None No Major Bad Sober Disobey traﬃc signal No Major Good Alcohol-impaired Exceed speed limit Yes Major Bad Sober Disobey stop sign Yes Minor (a) Show a binarized version of the data set. Answer: See Table 7.2. (b) What is the maximum width of each transaction in the binarized data? Answer: 5 (c) Assuming that support threshold is 30%, how many candidate and fre- quent itemsets will be generated? 96 Chapter 7 Association Analysis: Advanced Concepts Table 7.2. Trafﬁc accident data set. Good Bad Alcohol Sober Exceed None Disobey Disobey Belt Belt Major Minor speed stop traﬃc =No =Yes 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 Answer: 5 The number of candidate itemsets from size 1 to size 3 is 10+28+3 = 41. The number of frequent itemsets from size 1 to size 3 is 8 + 10 + 0 = 18. (d) Create a data set that contains only the following asymmetric binary at- tributes: (Weather = Bad, Driver’s condition = Alcohol-impaired, Traffic violation = Yes, Seat Belt = No, Crash Severity = Major). For Traffic violation, only None has a value of 0. The rest of the attribute values are assigned to 1. Assuming that support threshold is 30%, how many candidate and frequent itemsets will be generated? Answer: The binarized data is shown in Table 7.3. Table 7.3. Trafﬁc accident data set. Bad Alcohol Traﬃc Belt Major Impaired violation =No 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 The number of candidate itemsets from size 1 to size 3 is 5+10+0 = 15. 97 The number of frequent itemsets from size 1 to size 3 is 5 + 3 + 0 = 8. (e) Compare the number of candidate and frequent itemsets generated in parts (c) and (d). Answer: The second method produces less number of candidate and frequent itemsets. 2. (a) Consider the data set shown in Table 7.4. Suppose we apply the fol- lowing discretization strategies to the continuous attributes of the data set. D1: Partition the range of each continuous attribute into 3 equal-sized bins. D2: Partition the range of each continuous attribute into 3 bins; where each bin contains an equal number of transactions For each strategy, answer the following questions: i. Construct a binarized version of the data set. ii. Derive all the frequent itemsets having support ≥ 30%. Table 7.4. Data set for Exercise 2. TID Temperature Pressure Alarm 1 Alarm 2 Alarm 3 1 95 1105 0 0 1 2 85 1040 1 1 0 3 103 1090 1 1 1 4 97 1084 1 0 0 5 80 1038 0 1 1 6 100 1080 1 1 0 7 83 1025 1 0 1 8 86 1030 1 0 0 9 101 1100 1 1 1 Answer: Table 7.5 shows the discretized data using D1, where the discretized intervals are: • X1: Temperature between 80 and 87, • X2: Temperature between 88 and 95, • X3: Temperature between 96 and 103, • Y1: Pressure between 1025 and 1051, • Y2: Pressure between 1052 and 1078, • Y3: Pressure between 1079 and 1105. 98 Chapter 7 Association Analysis: Advanced Concepts Table 7.5. Discretized data using D1. TID X1 X2 X3 Y1 Y2 Y3 Alarm1 Alarm2 Alarm3 1 0 1 0 0 0 1 0 0 1 2 1 0 0 1 0 0 1 1 0 3 0 0 1 0 0 1 1 1 1 4 0 0 1 0 0 1 1 0 0 5 1 0 0 1 0 0 0 1 1 6 0 0 1 0 0 1 1 1 0 7 1 0 0 1 0 0 1 0 1 8 1 0 0 1 0 0 1 0 0 9 0 0 1 0 0 1 1 1 1 Table 7.6. Discretized data using D2. TID X1 X2 X3 Y1 Y2 Y3 Alarm1 Alarm2 Alarm3 1 0 1 0 0 0 1 0 0 1 2 1 0 0 0 1 0 1 1 0 3 0 0 1 0 0 1 1 1 1 4 0 1 0 0 1 0 1 0 0 5 1 0 0 1 0 0 0 1 1 6 0 0 1 0 1 0 1 1 0 7 1 0 0 1 0 0 1 0 1 8 0 1 0 1 0 0 1 0 0 9 0 0 1 0 0 1 1 1 1 Table 7.6 shows the discretized data using D1, where the discretized intervals are: • X1: Temperature between 80 and 85, • X2: Temperature between 86 and 97, • X3: Temperature between 100 and 103, • Y1: Pressure between 1025 and 1038, • Y2: Pressure between 1039 and 1084, • Y3: Pressure between 1085 and 1105. For D1, there are 7 frequent 1-itemset, 12 frequent 2-itemset, and 5 frequent 3-itemset. For D2, there are 9 frequent 1-itemset, 7 frequent 2-itemset, and 1 frequent 3-itemset. (b) The continuous attribute can also be discretized using a clustering ap- proach. i. Plot a graph of temperature versus pressure for the data points shown in Table 7.4. 99 Answer: The graph of Temperature and Pressure is shown below. Pressure vs Temperature 1020 1030 1040 1050 1060 1070 1080 1090 1100 1110 75 80 85 90 95 100 105 Temperature Pressure C1 C2 Figure 7.1. Temperature versus Pressure. ii. How many natural clusters do you observe from the graph? Assign a label (C1, C2, etc.) to each cluster in the graph. Answer: There are two natural clusters in the data. iii. What type of clustering algorithm do you think can be used to identify the clusters? State your reasons clearly. Answer: K-means algorithm. iv. Replace the temperature and pressure attributes in Table 7.4 with asymmetric binary attributes C1, C2, etc. Construct a transaction matrix using the new attributes (along with attributes Alarm1, Alarm2, and Alarm3). Answer: Table 7.7. Example of numeric data set. TID C1 C2 Alarm1 Alarm2 Alarm3 1 0 1 0 0 1 2 1 0 1 1 0 3 0 1 1 1 1 4 0 1 1 0 0 5 1 0 0 1 1 6 0 1 1 1 0 7 1 0 1 0 1 8 1 0 1 0 0 9 0 1 1 1 1 100 Chapter 7 Association Analysis: Advanced Concepts v. Derive all the frequent itemsets having support ≥ 30% from the binarized data. Answer: There are 5 frequent 1-itemset, 7 frequent 2-itemset, and 1 frequent 3-itemset. 3. Consider the data set shown in Table 7.8. The ﬁrst attribute is continuous, while the remaining two attributes are asymmetric binary. A rule is consid- ered to be strong if its support exceeds 15% and its conﬁdence exceeds 60%. The data given in Table 7.8 supports the following two strong rules: (i) {(1 ≤ A ≤ 2),B =1}→{C =1} (ii) {(5 ≤ A ≤ 8),B =1}→{C =1} Table 7.8. Data set for Exercise 3. A B C 1 1 1 2 1 1 3 1 0 4 1 0 5 1 1 6 0 1 7 0 0 8 1 1 9 0 0 10 0 0 11 0 0 12 0 1 (a) Compute the support and conﬁdence for both rules. Answer: s({(1 ≤ A ≤ 2),B =1}→{C =1})=1/6 c({(1 ≤ A ≤ 2),B =1}→{C =1})=1 s({(5 ≤ A ≤ 8),B =1}→{C =1})=1/6 c({(5 ≤ A ≤ 8),B =1}→{C =1})=1 (b) To ﬁnd the rules using the traditional Apriori algorithm, we need to discretize the continuous attribute A. Suppose we apply the equal width binning approach to discretize the data, with bin-width =2, 3, 4. For each bin-width, state whether the above two rules are discovered by the Apriori algorithm. (Note that the rules may not be in the same exact form as before because it may contain wider or narrower intervals 101 for A.) For each rule that corresponds to one of the above two rules, compute its support and conﬁdence. Answer: When bin − width =2: Table 7.9. A Synthetic Data set A1 A2 A3 A4 A5 A6 B C 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 Where A1=1≤ A ≤ 2; A2=3≤ A ≤ 4; A3=5≤ A ≤ 6; A4=7≤ A ≤ 8; A5=9≤ A ≤ 10; A6=11≤ A ≤ 12; For the ﬁrst rule, there is one corresponding rule: {A1=1,B =1}→{C =1} s(A1=1,B =1}→{C =1})=1/6 c(A1=1,B =1}→{C =1})=1 Since the support and conﬁdence are greater than the thresholds, the rule can be discovered. For the second rule, there are two corresponding rules: {A3=1,B =1}→{C =1} {A4=1,B =1}→{C =1} For both rules, the support is 1/12 and the conﬁdence is 1. Since the support is less than the threshold (15%), these rules cannot be generated. 102 Chapter 7 Association Analysis: Advanced Concepts When bin − width =3: Table 7.10. A Synthetic Data set A1 A2 A3 A4 B C 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 Where A1=1≤ A ≤ 3; A2=4≤ A ≤ 6; A3=7≤ A ≤ 9; A4=10≤ A ≤ 12; For the ﬁrst rule, there is one corresponding rule: {A1=1,B =1}→{C =1} s(A1=1,B =1}→{C =1})=1/6 c(A1=1,B =1}→{C =1})=2/3 Since the support and conﬁdence are greater than the thresholds, the rule can be discovered. The discovered rule is in general form than the original rule. For the second rule, there are two corresponding rules: {A2=1,B =1}→{C =1} {A3=1,B =1}→{C =1} For both rules, the support is 1/12 and the conﬁdence is 1. Since the support is less than the threshold (15%), these rules cannot be generated. 103 When bin − width =4: Table 7.11. A Synthetic Data set A1 A2 A3 B C 1 0 0 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 Where A1=1≤ A ≤ 4; A2=5≤ A ≤ 8; A3=9≤ A ≤ 12; For the ﬁrst rule, there is one correspomding rule: {A1=1,B =1}→{C =1} s(A1=1,B =1}→{C =1})=1/6 c(A1=1,B =1}→{C =1})=1/2 Since the conﬁdence is less than the threshold (60%), then the rule cannot be generated. For the second rule, there is one corresponding rule: {A2=1,B =1}→{C =1} s(A2=1,B =1}→{C =1})=1/6 c(A2=1,B =1}→{C =1})=1 Since the support and threshold are greater than thresholds, the the rule can be discovered. (c) Comment on the eﬀectiveness of using the equal width approach for classifying the above data set. Is there a bin-width that allows you to 104 Chapter 7 Association Analysis: Advanced Concepts ﬁnd both rules satisfactorily? If not, what alternative approach can you take to ensure that you will ﬁnd both rules? Answer: None of the discretization methods can eﬀectively ﬁnd both rules. One approach to ensure that you can ﬁnd both rules is to start with bin width equals to 2 and consider all possible mergings of the adjacent intervals. For example, the discrete intervals are: 1 <= A<=2,3<= A<=4,5<= A<=6,·,11<= A<=12 1 <= A<=4,5<= A<=8,9<= A<=12 4. Consider the data set shown in Table 7.12. Table 7.12. Data set for Exercise 4. Age Number of Hours Online per Week (B) (A) 0–5 5–10 10 – 20 20 – 30 30 – 40 10 – 15 2 3 5 3 2 15 – 25 2 5 10 10 3 25 – 35 10 15 5 3 2 35 – 50 4 6 5 3 2 (a) For each combination of rules given below, specify the rule that has the highest conﬁdence. i. 15 minsup, which of the following itemsets are guaranteed to be frequent? (i) s({ˆp, q}), (ii) s({p, ˆq}), and (iii) s({ˆp, ˆq}). Answer: All three itemsets are guaranteed to be frequent. (c) Consider the association rule {p}−→{q}. Suppose the conﬁdence of the rule exceeds minconf. Which of the following rules are guaranteed to have conﬁdence higher than minconf? (i) {p}−→{ˆq}, (ii) {ˆp}−→ {q}, and (iii) {ˆp}−→{ˆq}. Answer: Only {p}−→{ˆq} is guaranteed to have conﬁdence higher than minconf. 9. (a) List all the 4-subsequences contained in the following data sequence: < {1, 3}{2}{2, 3}{4} >, assuming no timing constraints. Answer: < {1, 3}{2}{2} ><{1, 3}{2}{3} > < {1, 3}{2}{4} ><{1, 3}{2, 3} > < {1, 3}{3}{4} ><{1}{2}{2, 3} > < {1}{2}{2}{4} ><{1}{2}{3}{4} > < {1}{2, 3}{4} ><{3}{2}{2, 3} > < {3}{2}{2}{4} ><{3}{2}{3}{4} > < {3}{2, 3}{4} ><{2}{2, 3}{4} > (b) List all the 3-element subsequences contained in the data sequence for part (a) assuming that no timing constraints are imposed. Answer: < {1, 3}{2}{2, 3} ><{1, 3}{2}{4} > < {1, 3}{3}{4} ><{1, 3}{2}{2} > < {1, 3}{2}{3} ><{1, 3}{2, 3}{4} > < {1}{2}{2, 3} ><{1}{2}{4} > < {1}{3}{4} ><{1}{2}{2} > < {1}{2}{3} ><{1}{2, 3}{4} > < {3}{2}{2, 3} ><{3}{2}{4} > < {3}{3}{4} ><{3}{2}{2} > < {3}{2}{3} ><{3}{2, 3}{4} > 111 (c) List all the 4-subsequences contained in the data sequence for part (a) (assuming the timing constraints are ﬂexible). Answer: This will include all the subsequences in part (a) as well as the following: < {1, 2, 3, 4} ><{1, 2, 3}{2} > < {1, 2, 3}{3} ><{1, 2, 3}{4} > < {1, 3}{2, 4} ><{1, 3}{3, 4} > < {1}{2}{2, 4} ><{1}{2}{3, 4} > < {3}{2}{2, 4} ><{3}{2}{3, 4} > < {1, 2}{2, 3} ><{1, 2}{2, 4} > < {1, 2}{3, 4} ><{1, 2}{2}{4} > < {1, 2}{3}{4} ><{2, 3}{2, 3} > < {2, 3}{2, 4} ><{2, 3}{3, 4} > < {2, 3}{2}{4} ><{2, 3}{3}{4} > < {1}{2, 3, 4} ><{1}{2}{2, 4} > < {1}{2}{3, 4} ><{3}{2, 3, 4} > < {3}{2}{2, 4} ><{3}{2}{3, 4} > < {2}{2, 3, 4} > (d) List all the 3-element subsequences contained in the data sequence for part (a) (assuming the timing constraints are ﬂexible). Answer: This will include all the subsequences in part (b) as well as the following: < {1, 2, 3}{2}{4} ><{1, 2, 3}{3}{4} > < {1, 2, 3}{2, 3}{4} ><{1, 2}{2}{4} > < {1, 2}{3}{4} ><{1, 2}{2, 3}{4} > < {2, 3}{2}{4} ><{2, 3}{3}{4} > < {2, 3}{2, 3}{4} ><{1}{2}{2, 4} > < {1}{2}{3, 4} ><{1}{2}{2, 3, 4} > < {3}{2}{2, 4} ><{3}{2}{3, 4} > < {3}{2}{2, 3, 4} ><{1, 3}{2}{2, 4} > < {1, 3}{2}{3, 4} ><{1, 3}{2}{2, 3, 4} > 10. Find all the frequent subsequences with support ≥ 50% given the sequence database shown in Table 7.15. Assume that there are no timing constraints imposed on the sequences. Answer: < {A} >, < {B} >, < {C} >, < {D} >, < {E} > < {A}{C} >, < {A}{D} >, < {A}{E} >, < {B}{C} >, < {B}{D} >, < {B}{E} >, < {C}{D} >, < {C}{E} >, < {D, E} > 11. (a) For each of the sequences w =

贡献于2015-05-19