2010 Privacy-Preserving Data Mining Models and Algorithms


Privacy-Preserving Data Mining Models and Algorithms ADVANCES IN DATABASE SYSTEMS Volume 34 Series Editors Ahmed K. Elmagarmid Amit P. Sheth Purdue University Wright State University West Lafayette, IN 47907 Dayton, Ohio 45435 Other books in the Series: SEQUENCE DATA MINING, Guozhu Dong, Jian Pei; ISBN: 978-0-387-69936-3 DATA STREAMS: Models and Algorithms, edited by Charu C. Aggarwal; ISBN: 978-0-387-28759-1 SIMILARITY SEARCH: The Metric Space Approach, P. Zezula, G. Amato, V. Dohnal, M. Batko; ISBN: 0-387-29146-6 STREAM DATA MANAGEMENT, Nauman Chaudhry, Kevin Shaw, Mahdi Abdelguerfi; ISBN: 0-387-24393-3 FUZZY DATABASE MODELING WITH XML, Zongmin Ma; ISBN: 0-387- 24248-1 MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang and Jiong Yang; ISBN: 0-387-24246-5 ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB APPLICATIONS, Yan- nis Manolopoulos, Alexandros Nanopoulos, Eleni Tousidou; ISBN: 1-4020-7425-5 ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and Policy, edited by William J. McIver, Jr. and Ahmed K. Elmagarmid; ISBN: 1-4020-7067-5 INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and Marcela Genero; ISBN: 0-7923-7599-8 DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923-7215-8 THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4 SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING, Shu- Ching Chen, R.L. Kashyap, and Arif Ghafoor; ISBN: 0-7923-7888-1 INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA: A Metadata- based Approach, Vipul Kashyap, Amit Sheth; ISBN: 0-7923-7883-0 DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS, Kian-Lee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic; ISBN: 0-7923-7840-7 ADVANCED DATABASE INDEXING, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras; ISBN: 0-7923-7716-8 MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil Jajodia, Binto George ISBN: 0-7923-7702-8 FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6 PRIVACY-PRESERVING DATA MINING: Models and Algorithms, edited by Charu C. Aggarwal and Philip S. Yu; ISBN: 0-387-70991-8 Privacy-Preserving Data Mining Models and Algorithms Charu C. Aggarwal IBM T.J. Watson Research Center, USA and Philip S. Yu University of Illinois at Chicago, USA ABC Edited by Editors: Charu C. Aggarwal IBM Thomas J. Watson Research Center 19 Skyline Drive Hawthorne NY 10532 charu@us.ibm.com Philip S. Yu Department of Computer Science University of Illinois at Chicago 854 South Morgan Street Chicago, IL 60607-7053 psyu@cs.uic.edu Series Editors Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907 Amit P. Sheth Wright State University Dayton, Ohio 45435 ISBN 978-0-387-70991-8 e-ISBN 978-0-387-70992-5 DOI 10.1007/978-0-387-70992-5 Library of Congress Control Number: 2007943463 c° 2008 Springer Science+Business Media, LLC. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com Preface In recent years, advances in hardware technology have lead to an increase in the capability to store and record personal data about consumers and indi- viduals. This has lead to concerns that the personal data may be misused for a variety of purposes. In order to alleviate these concerns, a number of tech- niques have recently been proposed in order to perform the data mining tasks in a privacy-preserving way. These techniques for performing privacy-preserving data mining are drawn from a wide array of related topics such as data mining, cryptography and information hiding. The material in this book is designed to be drawn from the different topics so as to provide a good overview of the important topics in the field. While a large number of research papers are now available in this field, many of the topics have been studied by different communities with different styles. At this stage, it becomes important to organize the topics in such a way that the relative importance of different research areas is recognized. Furthermore, the field of privacy-preserving data mining has been explored independently by the cryptography, database and statistical disclosure control communities. In some cases, the parallel lines of work are quite similar, but the communities are not sufficiently integrated for the provision of a broader perspective. This book will contain chapters from researchers of all three communities and will therefore try to provide a balanced perspective of the work done in this field. This book will be structured as an edited book from prominent researchers in the field. Each chapter will contain a survey which contains the key research content on the topic, and the future directions of research in the field. Emphasis will be placed on making each chapter self-sufficient. While the chapters will be written by different researchers, the topics and content is organized in such a way so as to present the most important models, algorithms, and applications in the privacy field in a structured and concise way. In addition, attention is paid in drawing chapters from researchers working in different areas in order to provide different points of view. Given the lack of structurally organized in- formation on the topic of privacy, the book will provide insights which are not easily accessible otherwise. A few chapters in the book are not surveys, since the corresponding topics fall in the emerging category, and enough material is vi Preface not available to create a survey. In such cases, the individual results have been included to give a flavor of the emerging research in the field. It is expected that the book will be a great help to researchers and graduate students inter- ested in the topic. While the privacy field clearly falls in the emerging category because of its recency, it is now beginning to reach a maturation and popularity point, where the development of an overview book on the topic becomes both possible and necessary. It is hoped that this book will provide a reference to students, researchers and practitioners in both introducing the topic of privacy- preserving data mining and understanding the practical and algorithmic aspects of the area. Contents Preface v List of Figures xvii List of Tables xxi 1 An Introduction to Privacy-Preserving Data Mining 1 Charu C. Aggarwal, Philip S. Yu 1.1. Introduction 1 1.2. Privacy-Preserving Data Mining Algorithms 3 1.3. Conclusions and Summary 7 References 8 2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms 11 Charu C. Aggarwal, Philip S. Yu 2.1. Introduction 11 2.2. The Randomization Method 13 2.2.1 Privacy Quantification 15 2.2.2 Adversarial Attacks on Randomization 18 2.2.3 Randomization Methods for Data Streams 18 2.2.4 Multiplicative Perturbations 19 2.2.5 Data Swapping 19 2.3. Group Based Anonymization 20 2.3.1 The k-Anonymity Framework 20 2.3.2 Personalized Privacy-Preservation 24 2.3.3 Utility Based Privacy Preservation 24 2.3.4 Sequential Releases 25 2.3.5 The l-diversity Method 26 2.3.6 The t-closeness Model 27 2.3.7 Models for Text, Binary and String Data 27 2.4. Distributed Privacy-Preserving Data Mining 28 2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets 30 2.4.2 Distributed Algorithms over Vertically Partitioned Data 31 2.4.3 Distributed Algorithms for k-Anonymity 32 viii Contents 2.5. Privacy-Preservation of Application Results 32 2.5.1 Association Rule Hiding 33 2.5.2 Downgrading Classifier Effectiveness 34 2.5.3 Query Auditing and Inference Control 34 2.6. Limitations of Privacy: The Curse of Dimensionality 37 2.7. Applications of Privacy-Preserving Data Mining 38 2.7.1 Medical Databases: The Scrub and Datafly Systems 39 2.7.2 Bioterrorism Applications 40 2.7.3 Homeland Security Applications 40 2.7.4 Genomic Privacy 42 2.8. Summary 43 References 43 3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining 53 Josep Domingo-Ferrer 3.1. Introduction 54 3.2. A classification of Microdata Protection Methods 55 3.3. Perturbative Masking Methods 58 3.3.1 Additive Noise 58 3.3.2 Microaggregation 59 3.3.3 Data Wapping and Rank Swapping 61 3.3.4 Rounding 62 3.3.5 Resampling 62 3.3.6 PRAM 62 3.3.7 MASSC 63 3.4. Non-perturbative Masking Methods 63 3.4.1 Sampling 64 3.4.2 Global Recoding 64 3.4.3 Top and Bottom Coding 65 3.4.4 Local Suppression 65 3.5. Synthetic Microdata Generation 65 3.5.1 Synthetic Data by Multiple Imputation 65 3.5.2 Synthetic Data by Bootstrap 66 3.5.3 Synthetic Data by Latin Hypercube Sampling 66 3.5.4 Partially Synthetic Data by Cholesky Decomposition 67 3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches 67 3.5.6 Pros and Cons of Synthetic Microdata 68 3.6. Trading off Information Loss and Disclosure Risk 69 3.6.1 Score Construction 69 3.6.2 R-U Maps 71 3.6.3 k-anonymity 71 3.7. Conclusions and Research Directions 72 References 73 Contents ix 4 Measures of Anonymity 81 Suresh Venkatasubramanian 4.1. Introduction 81 4.1.1 What is Privacy? 82 4.1.2 Data Anonymization Methods 83 4.1.3 A Classification of Methods 84 4.2. Statistical Measures of Anonymity 85 4.2.1 Query Restriction 85 4.2.2 Anonymity via Variance 85 4.2.3 Anonymity via Multiplicity 86 4.3. Probabilistic Measures of Anonymity 87 4.3.1 Measures Based on Random Perturbation 87 4.3.2 Measures Based on Generalization 90 4.3.3 Utility vs Privacy 94 4.4. Computational Measures of Anonymity 94 4.4.1 Anonymity via Isolation 97 4.5. Conclusions and New Directions 97 4.5.1 New Directions 98 References 99 5 k-Anonymous Data Mining: A Survey 105 V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati 5.1. Introduction 105 5.2. k-Anonymity 107 5.3. Algorithms for Enforcing k-Anonymity 110 5.4. k-Anonymity Threats from Data Mining 117 5.4.1 Association Rules 118 5.4.2 Classification Mining 118 5.5. k-Anonymity in Data Mining 120 5.6. Anonymize-and-Mine 123 5.7. Mine-and-Anonymize 126 5.7.1 Enforcing k-Anonymity on Association Rules 126 5.7.2 Enforcing k-Anonymity on Decision Trees 130 5.8. Conclusions 133 Acknowledgments 133 References 134 6 A Survey of Randomization Methods for Privacy-Preserving Data Mining 137 Charu C. Aggarwal, Philip S. Yu 6.1. Introduction 137 6.2. Reconstruction Methods for Randomization 139 6.2.1 The Bayes Reconstruction Method 139 6.2.2 The EM Reconstruction Method 141 6.2.3 Utility and Optimality of Randomization Models 143 x Contents 6.3. Applications of Randomization 144 6.3.1 Privacy-Preserving Classification with Randomization 144 6.3.2 Privacy-Preserving OLAP 145 6.3.3 Collaborative Filtering 145 6.4. The Privacy-Information Loss Tradeoff 146 6.5. Vulnerabilities of the Randomization Method 149 6.6. Randomization of Time Series Data Streams 151 6.7. Multiplicative Noise for Randomization 152 6.7.1 Vulnerabilities of Multiplicative Randomization 153 6.7.2 Sketch Based Randomization 153 6.8. Conclusions and Summary 154 References 154 7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining 157 Keke Chen and Ling Liu 7.1. Introduction 158 7.1.1 Data Privacy vs. Data Utility 159 7.1.2 Outline 160 7.2. Definition of Multiplicative Perturbation 161 7.2.1 Notations 161 7.2.2 Rotation Perturbation 161 7.2.3 Projection Perturbation 162 7.2.4 Sketch-based Approach 164 7.2.5 Geometric Perturbation 164 7.3. Transformation Invariant Data Mining Models 165 7.3.1 Definition of Transformation Invariant Models 166 7.3.2 Transformation-Invariant Classification Models 166 7.3.3 Transformation-Invariant Clustering Models 167 7.4. Privacy Evaluation for Multiplicative Perturbation 168 7.4.1 A Conceptual Multidimensional Privacy Evaluation Model 168 7.4.2 Variance of Difference as Column Privacy Metric 169 7.4.3 Incorporating Attack Evaluation 170 7.4.4 Other Metrics 171 7.5. Attack Resilient Multiplicative Perturbations 171 7.5.1 Naive Estimation to Rotation Perturbation 171 7.5.2 ICA-Based Attacks 173 7.5.3 Distance-Inference Attacks 174 7.5.4 Attacks with More Prior Knowledge 176 7.5.5 Finding Attack-Resilient Perturbations 177 7.6. Conclusion 177 Acknowledgment 178 References 179 8 A Survey of Quantification of Privacy Preserving Data Mining Algorithms 183 Elisa Bertino, Dan Lin and Wei Jiang 8.1. Introduction 184 8.2. Metrics for Quantifying Privacy Level 186 8.2.1 Data Privacy 186 Contents xi 8.2.2 Result Privacy 191 8.3. Metrics for Quantifying Hiding Failure 192 8.4. Metrics for Quantifying Data Quality 193 8.4.1 Quality of the Data Resulting from the PPDM Process 193 8.4.2 Quality of the Data Mining Results 198 8.5. Complexity Metrics 200 8.6. How to Select a Proper Metric 201 8.7. Conclusion and Research Directions 202 References 202 9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods 207 Ming Hua and Jian Pei 9.1. Introduction 208 9.1.1 What is Utility-based Privacy Preservation? 209 9.2. Types of Utility-based Privacy Preservation Methods 210 9.2.1 Privacy Models 210 9.2.2 Utility Measures 212 9.2.3 Summary of the Utility-Based Privacy Preserving Methods 214 9.3. Utility-Based Anonymization Using Local Recoding 214 9.3.1 Global Recoding and Local Recoding 215 9.3.2 Utility Measure 216 9.3.3 Anonymization Methods 217 9.3.4 Summary and Discussion 219 9.4. The Utility-based Privacy Preserving Methods in Classification Prob- lems 219 9.4.1 The Top-Down Specialization Method 220 9.4.2 The Progressive Disclosure Algorithm 224 9.4.3 Summary and Discussion 228 9.5. Anonymized Marginal: Injecting Utility into Anonymized Data Sets 228 9.5.1 Anonymized Marginal 229 9.5.2 Utility Measure 230 9.5.3 Injecting Utility Using Anonymized Marginals 231 9.5.4 Summary and Discussion 233 9.6. Summary 234 Acknowledgments 234 References 234 10 Mining Association Rules under Privacy Constraints 239 Jayant R. Haritsa 10.1. Introduction 239 10.2. Problem Framework 240 10.2.1 Database Model 240 10.2.2 Mining Objective 241 10.2.3 Privacy Mechanisms 241 10.2.4 Privacy Metric 243 10.2.5 Accuracy Metric 245 xii Contents 10.3. Evolution of the Literature 246 10.4. The FRAPP Framework 251 10.4.1 Reconstruction Model 252 10.4.2 Estimation Error 253 10.4.3 Randomizing the Perturbation Matrix 256 10.4.4 Efficient Perturbation 256 10.4.5 Integration with Association Rule Mining 258 10.5. Sample Results 259 10.6. Closing Remarks 263 Acknowledgments 263 References 263 11 A Survey of Association Rule Hiding Methods for Privacy 267 Vassilios S. Verykios and Aris Gkoulalas-Divanis 11.1. Introduction 267 11.2. Terminology and Preliminaries 269 11.3. Taxonomy of Association Rule Hiding Algorithms 270 11.4. Classes of Association Rule Algorithms 271 11.4.1 Heuristic Approaches 272 11.4.2 Border-based Approaches 277 11.4.3 Exact Approaches 278 11.5. Other Hiding Approaches 279 11.6. Metrics and Performance Analysis 281 11.7. Discussion and Future Trends 284 11.8. Conclusions 285 References 286 12 A Survey of Statistical Approaches to Preserving Confidentiality of Contingency Table Entries 291 Stephen E. Fienberg and Aleksandra B. Slavkovic 12.1. Introduction 291 12.2. The Statistical Approach Privacy Protection 292 12.3. Datamining Algorithms, Association Rules, and Disclosure Limitation 294 12.4. Estimation and Disclosure Limitation for Multi-way Contingency Tables 295 12.5. Two Illustrative Examples 301 12.5.1 Example 1: Data from a Randomized Clinical Trial 301 12.5.2 Example 2: Data from the 1993 U.S. Current Population Survey 305 12.6. Conclusions 308 Acknowledgments 309 References 309 13 A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data 313 Murat Kantarcioglu 13.1. Introduction 313 Contents xiii 13.2. Basic Cryptographic Techniques for Privacy-Preserving Distributed Data Mining 315 13.3. Common Secure Sub-protocols Used in Privacy-Preserving Distributed Data Mining 318 13.4. Privacy-preserving Distributed Data Mining on Horizontally Partitioned Data 323 13.5. Comparison to Vertically Partitioned Data Model 326 13.6. Extension to Malicious Parties 327 13.7. Limitations of the Cryptographic Techniques Used in Privacy- Preserving Distributed Data Mining 329 13.8. Privacy Issues Related to Data Mining Results 330 13.9. Conclusion 332 References 332 14 A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data 337 Jaideep Vaidya 14.1. Introduction 337 14.2. Classification 341 14.2.1 Na¨ıve Bayes Classification 342 14.2.2 Bayesian Network Structure Learning 343 14.2.3 Decision Tree Classification 344 14.3. Clustering 346 14.4. Association Rule Mining 347 14.5. Outlier detection 349 14.5.1 Algorithm 351 14.5.2 Security Analysis 352 14.5.3 Computation and Communication Analysis 354 14.6. Challenges and Research Directions 355 References 356 15 A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 359 Kun Liu, Chris Giannella, and Hillol Kargupta 15.1. Introduction 360 15.2. Definitions and Notation 360 15.3. Attacking Additive Data Perturbation 361 15.3.1 Eigen-Analysis and PCA Preliminaries 362 15.3.2 Spectral Filtering 363 15.3.3 SVD Filtering 364 15.3.4 PCA Filtering 365 15.3.5 MAP Estimation Attack 366 15.3.6 Distribution Analysis Attack 367 15.3.7 Summary 367 15.4. Attacking Matrix Multiplicative Data Perturbation 369 15.4.1 Known I/O Attacks 370 15.4.2 Known Sample Attack 373 15.4.3 Other Attacks Based on ICA 374 xiv Contents 15.4.4 Summary 375 15.5. Attacking k-Anonymization 376 15.6. Conclusion 376 Acknowledgments 377 References 377 16 Private Data Analysis via Output Perturbation 383 Kobbi Nissim 16.1. Introduction 383 16.2. The Abstract Model – Statistical Databases, Queries, and Sanitizers 385 16.3. Privacy 388 16.3.1 Interpreting the Privacy Definition 390 16.4. The Basic Technique: Calibrating Noise to Sensitivity 394 16.4.1 Applications: Functions with Low Global Sensitivity 396 16.5. Constructing Sanitizers for Complex Functionalities 400 16.5.1 k-Means Clustering 401 16.5.2 SVD and PCA 403 16.5.3 Learning in the Statistical Queries Model 404 16.6. Beyond the Basics 405 16.6.1 Instance Based Noise and Smooth Sensitivity 406 16.6.2 The Sample-Aggregate Framework 408 16.6.3 A General Sanitization Mechanism 409 16.7. Related Work and Bibliographic Notes 409 Acknowledgments 411 References 411 17 A Survey of Query Auditing Techniques for Data Privacy 415 Shubha U. Nabar, Krishnaram Kenthapadi, Nina Mishra and Rajeev Motwani 17.1. Introduction 415 17.2. Auditing Aggregate Queries 416 17.2.1 Offline Auditing 417 17.2.2 Online Auditing 418 17.3. Auditing Select-Project-Join Queries 426 17.4. Challenges in Auditing 427 17.5. Reading 429 References 430 18 Privacy and the Dimensionality Curse 433 Charu C. Aggarwal 18.1. Introduction 433 18.2. The Dimensionality Curse and the k-anonymity Method 435 18.3. The Dimensionality Curse and Condensation 441 18.4. The Dimensionality Curse and the Randomization Method 446 18.4.1 Effects of Public Information 446 18.4.2 Effects of High Dimensionality 450 18.4.3 Gaussian Perturbing Distribution 450 18.4.4 Uniform Perturbing Distribution 455 Contents xv 18.5. The Dimensionality Curse and l-diversity 458 18.6. Conclusions and Research Directions 459 References 460 19 Personalized Privacy Preservation 461 Yufei Tao and Xiaokui Xiao 19.1. Introduction 461 19.2. Formalization of Personalized Anonymity 463 19.2.1 Personal Privacy Requirements 464 19.2.2 Generalization 465 19.3. Combinatorial Process of Privacy Attack 467 19.3.1 Primary Case 468 19.3.2 Non-primary Case 469 19.4. Theoretical Foundation 470 19.4.1 Notations and Basic Properties 471 19.4.2 Derivation of the Breach Probability 472 19.5. Generalization Algorithm 473 19.5.1 The Greedy Framework 474 19.5.2 Optimal SA-generalization 476 19.6. Alternative Forms of Personalized Privacy Preservation 478 19.6.1 Extension of k-anonymity 479 19.6.2 Personalization in Location Privacy Protection 480 19.7. Summary and Future Work 482 References 485 20 Privacy-Preserving Data Stream Classification 487 Yabo Xu, Ke Wang, Ada Wai-Chee Fu, Rong She, and Jian Pei 20.1. Introduction 487 20.1.1 Motivating Example 488 20.1.2 Contributions and Paper Outline 490 20.2. Related Works 491 20.3. Problem Statement 493 20.3.1 Secure Join Stream Classification 493 20.3.2 Naive Bayesian Classifiers 494 20.4. Our Approach 495 20.4.1 Initialization 495 20.4.2 Bottom-Up Propagation 496 20.4.3 Top-Down Propagation 497 20.4.4 Using NBC 499 20.4.5 Algorithm Analysis 500 20.5. Empirical Studies 501 20.5.1 Real-life Datasets 502 20.5.2 Synthetic Datasets 504 20.5.3 Discussion 506 20.6. Conclusions 507 References 508 Index 511 List of Figures 5.1 Simplified representation of a private table 108 5.2 An example of domain and value generalization hierarchies 109 5.3 Classification of k-anonymity techniques [11] 110 5.4 Generalization hierarchy for QI={Marital status, Sex} 111 5.5 Index assignment to attributes Marital status and Sex 112 5.6 An example of set enumeration tree over set I = {1, 2, 3} of indexes 113 5.7 Sub-hierarchies computed by Incognito for the table in Figure 5.1 114 5.8 Spatial representation (a) and possible partitioning (b)-(d) of the table in Figure 5.1 116 5.9 An example of decision tree 119 5.10 Different approaches for combining k-anonymity and data mining 120 5.11 An example of top-down anonymization for the private table in Figure 5.1 124 5.12 Frequent itemsets extracted from the table in Figure 5.1 127 5.13 An example of binary table 128 5.14 Itemsets extracted from the table in Figure 5.13(b) 128 5.15 Itemsets with support at least equal to 40 (a) and corresponding anonymized itemsets (b) 129 5.16 3-anonymous version of the tree of Figure 5.9 131 5.17 Suppression of occurrences in non-leaf nodes in the tree in Figure 5.9 132 5.18 Table inferred from the decision tree in Figure 5.17 132 5.19 11-anonymous version of the tree in Figure 5.17 132 5.20 Table inferred from the decision tree in Figure 5.19 133 6.1 Illustration of the Information Loss Metric 149 7.1 Using known points and distance relationship to infer the rotation matrix 175 xviii List of Figures 9.1 A taxonomy tree on categorical attribute Education 221 9.2 A taxonomy tree on continuous attribute Age 221 9.3 Interactive graph 232 9.4 A decomposition 232 10.1 CENSUS (γ =19) 261 10.2 Perturbation Matrix Condition Numbers (γ =19) 262 13.1 Relationship between Secure Sub-protocols and Privacy Preserving Distributed Data Mining on Horizontally Partitioned Data 323 14.1 Two dimensional problem that cannot be decomposed into two one-dimensional problems 340 15.1 Wigner’s semi-circle law: a histogram of the eigenvalues of A+A 2 √ 2p for a large, randomly generated A 363 17.1 Skeleton of a simulatable private randomized auditor 423 18.1 Some Examples of Generalization for 2-Anonymity 435 18.2 Upper Bound of 2-anonymity Probability in an Non-Empty Grid Cell 439 18.3 Fraction of Data Points Preserving 2-Anonymity with Data Dimensionality (Gaussian Clusters) 440 18.4 Minimum Information Loss for 2-Anonymity (Gaussian Clusters) 445 18.5 Randomization Level with Increasing Dimensionality, Perturbation level =8· σo (UniDis) 457 19.1 Microdata and generalization 462 19.2 The taxonomy of attribute Disease 463 19.3 A possible result of our generalization scheme 466 19.4 The voter registration list 468 19.5 Algorithm for computing personalized generalization 474 19.6 Algorithm for finding the optimal SA-generalization 478 19.7 Personalized k-anonymous generalization 480 20.1 Related streams / tables 489 20.2 The join stream 489 20.3 Example with 3 streams at initialization 496 20.4 After bottom-up propagations 498 20.5 After top-down propagations 499 20.6 UK road accident data (2001) 502 20.7 Classifier accuracy 503 20.8 Time per input tuple 503 xix 20.9 Classifier accuracy vs. window size 505 20.10 Classifier accuracy vs. concept drifting interval 505 20.11 Time per input tuple vs. window size 506 20.12 Time per input tuple vs. blow-up ratio 506 20.13 Time per input tuple vs. number of streams 507 List of Figures List of Tables 3.1 Perturbative methods vs data types. “X” denotes applica- ble and “(X)” denotes applicable with some adaptation 58 3.2 Example of rank swapping. Left, original file; right, rankswapped file 62 3.3 Non-perturbative methods vs data types 64 9.1a The original table 209 9.2b A 2-anonymized table with better utility 209 9.3c A 2-anonymized table with poorer utility 209 9.4 Summary of utility-based privacy preserving methods 214 9.5a 3-anonymous table by global recoding 215 9.6b 3-anonymous table by local recoding 215 9.7a The original table 223 9.8b The anonymized table 223 9.9a The original table 225 9.10b The suppressed table 225 9.11a The original table 229 9.12b The anonymized table 229 9.13a Age Marginal 229 9.14b (Education, AnnualIncome) Marginal 229 10.1 CENSUS Dataset 260 10.2 Frequent Itemsets for supmin =0.02 260 12.1 Results of clinical trial for the effectiveness of an analgesic drug 302 12.2 Second panel has LP relaxation bounds, and third panel has sharp IP bounds for cell entries in Table 1.1 given [R|CST] conditional probability values 303 12.3 Sharp upper and lower bounds for cell entries in Ta- ble 12.1 given the [CSR] margin, and LP relaxation bounds given [R|CS] conditional probability values 304 12.4 Description of variables in CPS data extract 305 xxii 12.5 Marginal table [ACDGH] from 8-way CPS table 306 12.6 Summary of difference between upper and lower bounds for small cell counts in the full 8-way CPS table under Model 1 and under Model 2 307 14.1 The Weather Dataset 338 14.2 Arbitrary partitioning of data between 2 sites 339 14.3 Vertical partitioning of data between 2 sites 340 15.1 Summarization of Attacks on Additive Perturbation 368 15.2 Summarization of Attacks on Matrix Multiplicative Perturbation 375 18.1 Notations and Definitions 441 List of Tables Chapter 1 An Introduction to Privacy-Preserving Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@cs.uic.edu Abstract The field of privacy has seen rapid advances in recent years because of the in- creases in the ability to store data. In particular, recent advances in the data mining field have lead to increased concerns about privacy. While the topic of privacy has been traditionally studied in the context of cryptography and information-hiding, recent emphasis on data mining has lead to renewed interest in the field. In this chapter, we will introduce the topic of privacy-preserving data mining and provide an overview of the different topics covered in this book. Keywords: Privacy-preserving data mining, privacy, randomization, k-anonymity. 1.1 Introduction The problem of privacy-preserving data mining has become more impor- tant in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization and k-anonymity [1, 4, 16] have been suggested in recent years in order to per- form privacy-preserving data mining. Furthermore, the problem has been dis- cussed in multiple communities such as the database community, the statistical disclosure control community and the cryptography community. In some cases, the different communities have explored parallel lines of work which are quite similar. This book will try to explore different topics from the perspective of 2 Privacy-Preserving Data Mining: Models and Algorithms different communities, and will try to give a fused idea of the work in different communities. The key directions in the field of privacy-preserving data mining are as fol- lows: Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy. These tech- niques include methods such as randomization [1], k-anonymity [16, 7], and l-diversity [11]. Another related issue is how the perturbed data can be used in conjunction with classical data mining methods such as as- sociation rule mining [15]. Other related problems include that of deter- mining privacy-preserving methods to keep the underlying data useful (utility-based methods), or the problem of studying the different defi- nitions of privacy, and how they compare in terms of effectiveness in different scenarios. Changing the results of Data Mining Applications to preserve pri- vacy: In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the pri- vacy of the data. This has spawned a field of privacy in which the results of data mining algorithms such as association rule mining are modified in order to preserve the privacy of the data. A classic example of such techniques are association rule hiding methods, in which some of the association rules are suppressed in order to preserve privacy. Query Auditing: Such methods are akin to the previous case of modify- ing the results of data mining algorithms. Here, we are either modifying or restricting the results of queries. Methods for perturbing the output of queries are discussed in [8], whereas techniques for restricting queries are discussed in [9, 13]. Cryptographic Methods for Distributed Privacy: In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. In such cases, a variety of cryptographic protocols may be used in order to communicate among the different sites, so that secure function com- putation is possible without revealing sensitive information. A survey of such methods may be found in [14]. Theoretical Challenges in High Dimensionality: Real data sets are usually extremely high dimensional, and this makes the process of privacy-preservation extremely difficult both from a computational and effectiveness point of view. In [12], it has been shown that optimal k-anonymization is NP-hard. Furthermore, the technique is not even ef- fective with increasing dimensionality, since the data can typically be An Introduction to Privacy-Preserving Data Mining 3 combined with either public or background information to reveal the identity of the underlying record owners. A variety of methods for ad- versarial attacks in the high dimensional case are discussed in [5, 6]. This book will attempt to cover the different topics from the point of view of different communities in the field. This chapter will provide an overview of the different privacy-preserving algorithms covered in this book. We will discuss the challenges associated with each kind of problem, and discuss an overview of the material in the corresponding chapter. 1.2 Privacy-Preserving Data Mining Algorithms In this section, we will discuss the key stream mining problems and will discuss the challenges associated with each problem. We will also discuss an overview of the material covered in each chapter of this book. The broad topics covered in this book are as follows: General Survey. In chapter 2, we provide a broad survey of privacy- preserving data-mining methods. We provide an overview of the different techniques and how they relate to one another. The individual topics will be covered in sufficient detail to provide the reader with a good reference point. The idea is to provide an overview of the field for a new reader from the per- spective of the data mining community. However, more detailed discussions are deferred to future chapters which contain descriptions of different data mining algorithms. Statistical Methods for Disclosure Control. The topic of privacy-preserv- ing data mining has often been studied extensively by the data mining com- munity without sufficient attention to the work done by the conventional work done by the statistical disclosure control community. In chapter 3, detailed methods for statistical disclosure control have been presented along with some of the relationships to the parallel work done in the database and data mining community. This includes methods such as k-anonymity, swapping, random- ization, micro-aggregation and synthetic data generation. The idea is to give the readers an overview of the common themes in privacy-preserving data mining by different communities. Measures of Anonymity. There are a very large number of definitions of anonymity in the privacy-preserving data mining field. This is partially because of the varying goals of different privacy-preserving data mining algorithms. For example, methods such as k-anonymity, l-diversity and t-closeness are all designed to prevent identification, though the final goal is to preserve the un- derlying sensitive information. Each of these methods is designed to prevent 4 Privacy-Preserving Data Mining: Models and Algorithms disclosure of sensitive information in a different way. Chapter 4 is a survey of different measures of anonymity. The chapter tries to define privacy from the perspective of anonymity measures and classifies such measures. The chap- ter also compares and contrasts different measures, and discusses the relative advantages of different measures. This chapter thus provides an overview and perspective of the different ways in which privacy could be defined, and what the relative advantages of each method might be. Thek-anonymity Method. An important method for privacy de-identification is the method of k-anonymity [16]. The motivating factor behind the k- anonymity technique is that many attributes in the data can often be consid- ered pseudo-identifiers which can be used in conjunction with public records in order to uniquely identify the records. For example, if the identifications from the records are removed, attributes such as the birth date and zip-code an be used in order to uniquely identify the identities of the underlying records. Theideaink-anonymity is to reduce the granularity of representation of the data in such a way that a given record cannot be distinguished from at least (k − 1) other records. In chapter 5, the k-anonymity method is discussed in detail. A number of important algorithms for k-anonymity are discussed in the same chapter. The Randomization Method. The randomization technique uses data dis- tortion methods in order to create private representations of the records [1, 4]. In most cases, the individual records cannot be recovered, but only aggregate distributions can be recovered. These aggregate distributions can be used for data mining purposes. Two kinds of perturbation are possible with the random- ization method: Additive Perturbation: In this case, randomized noise is added to the data records. The overall data distributions can be recovered from the randomized records. Data mining and management algorithms re de- signed to work with these data distributions. A detailed discussion of these methods is provided in chapter 6. Multiplicative Perturbation: In this case, the random projection or ran- dom rotation techniques are used in order to perturb the records. A de- tailed discussion of these methods is provided in chapter 7. In addition, these chapters deal with the issue of adversarial attacks and vul- nerabilities of these methods. Quantification of Privacy. A key issue in measuring the security of dif- ferent privacy-preservation methods is the way in which the underlying pri- vacy is quantified. The idea in privacy quantification is to measure the risk of An Introduction to Privacy-Preserving Data Mining 5 disclosure for a given level of perturbation. In chapter 8, the issue of quantifi- cation of privacy is closely examined. The chapter also examines the issue of utility, and its natural tradeoff with privacy quantification. A discussion of the relative advantages of different kinds of methods is presented. Utility Based Privacy-Preserving Data Mining. Most privacy-preserving data mining methods apply a transformation which reduces the effectiveness of the underlying data when it is applied to data mining methods or algo- rithms. In fact, there is a natural tradeoff between privacy and accuracy, though this tradeoff is affected by the particular algorithm which is used for privacy- preservation. A key issue is to maintain maximum utility of the data with- out compromising the underlying privacy constraints. In chapter 9, a broad overview of the different utility based methods for privacy-preserving data mining is presented. The issue of designing utility based algorithms to work effectively with certain kinds of data mining problems is addressed. Mining Association Rules under Privacy Constraints. Since association rule mining is one of the important problems in data mining, we have devoted a number of chapters to this problem. There are two aspects to the privacy- preserving association rule mining problem: When the input to the data is perturbed, it is a challenging problem to accurately determine the association rules on the perturbed data. Chapter 10 discusses the problem of association rule mining on the perturbed data. A different issue is that of output association rule privacy. In this case, we try to ensure that none of the association rules in the output result in leakage of sensitive data. This problem is referred to as association rule hiding [17] by the database community, and that of contingency table privacy-preservation by the statistical community. The problem of output association rule privacy is briefly discussed in chapter 10. A detailed survey of association rule hiding from the perspective of the database community is discussed in chapter 11, and a discussion from the perspective of the statistical community is discussed in chapter 12. Cryptographic Methods for Information Sharing and Privacy. In many cases, multiple parties may wish to share aggregate private data, without leak- ing any sensitive information at their end [14]. For example, different super- stores with sensitive sales data may wish to coordinate among themselves in knowing aggregate trends without leaking the trends of their individual stores. This requires secure and cryptographic protocols for sharing the information 6 Privacy-Preserving Data Mining: Models and Algorithms across the different parties. The data may be distributed in two ways across different sites: Horizontal Partitioning: In this case, the different sites may have dif- ferent sets of records containing the same attributes. Vertical Partitioning: In this case, the different sites may have different attributes of the same sets of records. Clearly, the challenges for the horizontal and vertical partitioning case are quite different. In chapters 13 and 14, a variety of cryptographic protocols for hor- izontally and vertically partitioned data are discussed. The different kinds of cryptographic methods are introduced in chapter 13. Methods for horizontally partitioned data are discussed in chapter 13, whereas methods for vertically partitioned data are discussed in chapter 14. Privacy Attacks. It is useful to examine the different ways in which one can make adversarial attacks on privacy-transformed data. This helps in designing more effective privacy-transformation methods. Some examples of methods which can be used in order to attack the privacy of the underlying data include SVD-based methods, spectral filtering methods and background knowledge attacks. In chapter 15, a detailed description of different kinds of attacks on data perturbation methods is provided. Query Auditing and Inference Control. Many private databases are open to querying. This can compromise the security of the results, when the adver- sary can use different kinds of queries in order to undermine the security of the data. For example, a combination of range queries can be used in order to narrow down the possibilities for that record. Therefore, the results over mul- tiple queries can be combined in order to uniquely identify a record, or at least reduce the uncertainty in identifying it. There are two primary methods for preventing this kind of attack: Query Output Perturbation: In this case, we add noise to the output of the query result in order to preserve privacy [8]. A detailed description of such methods is provided in chapter 16. Query Auditing: In this case, we choose to deny a subset of the queries, so that the particular combination of queries cannot be used in order to violate the privacy [9, 13]. A detailed survey of query auditing methods have been provided in chapter 17. Privacy and the Dimensionality Curse. In recent years, it has been observed that many privacy-preservation methods such as k-anonymity and randomization are not very effective in the high dimensional case [5, 6]. In An Introduction to Privacy-Preserving Data Mining 7 chapter 18, we have provided a detailed description of the effects of the dimen- sionality curse on different kinds of privacy-preserving data mining algorithm. It is clear from the discussion in the chapter that most privacy methods are not very effective in the high dimensional case. Personalized Privacy Preservation. In many applications, different sub- jects have different requirements for privacy. For example, a brokerage cus- tomer with a very large account would likely have a much higher level of privacy-protection than a customer with a lower level of privacy protection. In such case, it is necessary to personalize the privacy-protection algorithm. In personalized privacy-preservation, we construct anonymizations of the data such that different records have a different level of privacy. Two examples of personalized privacy-preservation methods are discussed in [3, 18]. The method in [3] uses condensation approach for personalized anonymization, while the method in [18] uses a more conventional generalization approach for anonymization. In chapter 19, a number of algorithms for personalized anonymity are examined. Privacy-Preservation of Data Streams. A new topic in the area of privacy- preserving data mining is that of data streams, in which data grows rapidly at an unlimited rate. In such cases, the problem of privacy-preservation is quite challenging since the data is being released incrementally. In addition, the fast nature of data streams obviates the possibility of using the past history of the data. We note that both the topics of data streams and privacy-preserving data mining are relatively new, and there has not been much work on combining the two topics. Some work has been done on performing randomization of data streams [10], and other work deals with the issue of condensation based anonymization [2] of data streams. Both of these methods are discussed in Chapters 2 and 5, which are surveys on privacy and randomization respectively. Nevertheless, the literature on the stream topic remains sparse. Therefore, in chapter 20, we have added a chapter which specifically deals with the issue of privacy-preserving classification of data streams. While this chapter is unlike other chapters in the sense that it is not a survey, we have included it in order to provide a flavor of the emerging techniques in this important area of research. 1.3 Conclusions and Summary In this chapter, we introduced the problem of privacy-preserving data min- ing and discussed the broad areas of research in the field. The broad areas of privacy are as follows: Privacy-preserving data publishing: This corresponds to sanitizing the data, so that its privacy remains preserved. 8 Privacy-Preserving Data Mining: Models and Algorithms Privacy-Preserving Applications: This corresponds to designing data management and mining algorithms in such a way that the privacy re- mains preserved. Some examples include association rule mining, clas- sification, and query processing. Utility Issues: Since the perturbed data may often be used for mining and management purposes, its utility needs to be preserved. Therefore, the data mining and privacy transformation techniques need to be de- signed effectively, so to to preserve the utility of the results. Distributed Privacy, cryptography and adversarial collaboration: This corresponds to secure communication protocols between trusted parties, so that information can be shared effectively without revealing sensitive information about particular parties. We also discussed a broad overview of the different topics discussed in this book. In the remaining chapters, the surveys will provide a comprehensive treatment of the topics in each category. References [1] Agrawal R., Srikant R. Privacy-Preserving Data Mining. ACM SIGMOD Conference, 2000. [2] Aggarwal C. C., Yu P. S.: A Condensation approach to privacy preserving data mining. EDBT Conference, 2004. [3] Aggarwal C. C., Yu P. S. On Variable Constraints in Privacy Preserving Data Mining. ACM SIAM Data Mining Conference, 2005. [4] Agrawal D. Aggarwal C. C. On the Design and Quantification of Privacy Preserving Data Mining Algorithms. ACM PODS Conference, 2002. [5] Aggarwal C. C. On k-anonymity and the curse of dimensionality. VLDB Conference, 2004. [6] Aggarwal C. C. On Randomization, Public Information, and the Curse of Dimensionality. ICDE Conference, 2007. [7] Bayardo R. J., Agrawal R. Data Privacy through optimal k-anonymization. ICDE Conference, 2005. [8] Blum A., Dwork C., McSherry F., Nissim K.: Practical Privacy: The SuLQ Framework. ACM PODS Conference, 2005. [9] Kenthapadi K.,Mishra N., Nissim K.: Simulatable Auditing, ACM PODS Conference, 2005. [10] Li F., Sun J., Papadimitriou S. Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. An Introduction to Privacy-Preserving Data Mining 9 [11] Machanavajjhala A., Gehrke J., Kifer D. -diversity: Privacy beyond k-anonymity. IEEE ICDE Conference, 2006. [12] Meyerson A., Williams R. On the complexity of optimal k-anonymity. ACM PODS Conference, 2004. [13] Nabar S., Marthi B., Kenthapadi K., Mishra N., Motwani R.: Towards Robustness in Query Auditing. VLDB Conference, 2006. [14] Pinkas B.: Cryptographic Techniques for Privacy-Preserving Data Min- ing. ACM SIGKDD Explorations, 4(2), 2002. [15] Rizvi S., Haritsa J. Maintaining Data Privacy in Association Rule Mining. VLDB Conference, 2002. [16] Samarati P., Sweeney L. Protecting Privacy when Disclosing Informa- tion: k-Anonymity and its Enforcement Through Generalization and Sup- pression. IEEE Symp. on Security and Privacy, 1998. [17] Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.,, Dasseni E.: As- sociation Rule Hiding. IEEE Transactions on Knowledge and Data En- gineering, 16(4), 2004. [18] Xiao X., Tao Y.. Personalized Privacy Preservation. ACM SIGMOD Con- ference, 2006. Chapter 2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@cs.uic.edu Abstract In recent years, privacy-preserving data mining has been studied extensively, be- cause of the wide proliferation of sensitive information on the internet. A num- ber of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distrib- uted privacy-preserving data mining. We also discuss cases in which the out- put of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets. Keywords: Privacy-preserving data mining, randomization, k-anonymity. 2.1 Introduction In recent years, data mining has been viewed as a threat to privacy because of the widespread proliferation of electronic data maintained by corporations. This has lead to increased concerns about the privacy of the underlying data. In recent years, a number of techniques have been proposed for modifying or transforming the data in such a way so as to preserve privacy. A survey on some of the techniques used for privacy-preserving data mining may be found 12 Privacy-Preserving Data Mining: Models and Algorithms in [123]. In this chapter, we will study an overview of the state-of-the-art in privacy-preserving data mining. Privacy-preserving data mining finds numerous applications in surveillance which are naturally supposed to be “privacy-violating” applications. The key is to design methods [113] which continue to be effective, without compro- mising security. In [113], a number of techniques have been discussed for bio- surveillance, facial de-dentification, and identity theft. More detailed discus- sions on some of these sssues may be found in [96, 114–116]. Most methods for privacy computations use some form of transformation on the data in order to perform the privacy preservation. Typically, such meth- ods reduce the granularity of representation in order to reduce the privacy. This reduction in granularity results in some loss of effectiveness of data manage- ment or mining algorithms. This is the natural trade-off between information loss and privacy. Some examples of such techniques are as follows: The randomization method: The randomization method is a technique for privacy-preserving data mining in which noise is added to the data in order to mask the attribute values of records [2, 5]. The noise added is sufficiently large so that individual record values cannot be recov- ered. Therefore, techniques are designed to derive aggregate distribu- tions from the perturbed records. Subsequently, data mining techniques can be developed in order to work with these aggregate distributions. We will describe the randomization technique in greater detail in a later section. The k-anonymity model and l-diversity: The k-anonymity model was developed because of the possibility of indirect identification of records from public databases. This is because combinations of record attributes can be used to exactly identify individual records. In the k-anonymity method, we reduce the granularity of data representation with the use of techniques such as generalization and suppression. This granularity is reduced sufficiently that any given record maps onto at least k other records in the data. The l-diversity model was designed to handle some weaknesses in the k-anonymity model since protecting identities to the level of k-individuals is not the same as protecting the corresponding sensitive values, especially when there is homogeneity of sensitive val- ues within a group. To do so, the concept of intra-group diversity of sensitive values is promoted within the anonymization scheme [83]. Distributed privacy preservation: In many cases, individual entities may wish to derive aggregate results from data sets which are partitioned across these entities. Such partitioning may be horizontal (when the records are distributed across multiple entities) or vertical (when the attributes are distributed across multiple entities). While the individual A General Survey of Privacy-Preserving Data Mining Models and Algorithms 13 entities may not desire to share their entire data sets, they may consent to limited information sharing with the use of a variety of protocols. The overall effect of such methods is to maintain privacy for each individual entity, while deriving aggregate results over the entire data. Downgrading Application Effectiveness: In many cases, even though the data may not be available, the output of applications such as association rule mining, classification or query processing may result in violations of privacy. This has lead to research in downgrading the effectiveness of applications by either data or application modifications. Some exam- ples of such techniques include association rule hiding [124], classifier downgrading [92], and query auditing [1]. In this paper, we will provide a broad overview of the different techniques for privacy-preserving data mining. We will provide a review of the major algo- rithms available for each method, and the variations on the different techniques. We will also discuss a number of combinations of different concepts such as k-anonymous mining over vertically- or horizontally-partitioned data. We will also discuss a number of unique challenges associated with privacy-preserving data mining in the high dimensional case. This paper is organized as follows. In section 2, we will introduce the ran- domization method for privacy preserving data mining. In section 3, we will discuss the k-anonymization method along with its different variations. In section 4, we will discuss issues in distributed privacy-preserving data mining. In section 5, we will discuss a number of techniques for privacy which arise in the context of sensitive output of a variety of data mining and data man- agement applications. In section 6, we will discuss some unique challenges associated with privacy in the high dimensional case. A number of applica- tions of privacy-preserving models and algorithms are discussed in Section 7. Section 8 contains the conclusions and discussions. 2.2 The Randomization Method In this section, we will discuss the randomization method for privacy- preserving data mining. The randomization method has been traditionally used in the context of distorting data by probability distribution for methods such as surveys which have an evasive answer bias because of privacy concerns [74, 129]. This technique has also been extended to the problem of privacy- preserving data mining [2]. The method of randomization can be described as follows. Consider a set of data records denoted by X = {x1 ...xN}. For record xi ∈ X,weadd a noise component which is drawn from the probability distribution fY(y). These noise components are drawn independently, and are denoted y1 ...yN. Thus, the new set of distorted records are denoted by x1 + y1 ...xN + yN.We 14 Privacy-Preserving Data Mining: Models and Algorithms denote this new set of records by z1 ...zN. In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data. Thus, the original records cannot be recovered, but the distribution of the original records can be recov- ered. Thus, if X be the random variable denoting the data distribution for the original record, Y be the random variable describing the noise distribution, and Z be the random variable denoting the final record, we have: Z = X + Y X = Z − Y Now, we note that N instantiations of the probability distribution Z are known, whereas the distribution Y is known publicly. For a large enough number of values of N, the distribution Z can be approximated closely by using a vari- ety of methods such as kernel density estimation. By subtracting Y from the approximated distribution of Z, it is possible to approximate the original prob- ability distribution X. In practice, one can combine the process of approxima- tion of Z with subtraction of the distribution Y from Z by using a variety of iterative methods such as those discussed in [2, 5]. Such iterative methods typi- cally have a higher accuracy than the sequential solution of first approximating Z and then subtracting Y from it. In particular, the EM method proposed in [5] shows a number of optimal properties in approximating the distribution of X. We note that at the end of the process, we only have a distribution contain- ing the behavior of X. Individual records are not available. Furthermore, the distributions are available only along individual dimensions. Therefore, new data mining algorithms need to be designed to work with the uni-variate dis- tributions rather than the individual records. This can sometimes be a chal- lenge, since many data mining algorithms are inherently dependent on sta- tistics which can only be extracted from either the individual records or the multi-variate probability distributions associated with the records. While the approach can certainly be extended to multi-variate distributions, density es- timation becomes inherently more challenging [112] with increasing dimen- sionalities. For even modest dimensionalities such as 7 to 10, the process of density estimation becomes increasingly inaccurate, and falls prey to the curse of dimensionality. One key advantage of the randomization method is that it is relatively sim- ple, and does not require knowledge of the distribution of other records in the data. This is not true of other methods such as k-anonymity which re- quire the knowledge of other records in the data. Therefore, the randomization method can be implemented at data collection time, and does not require the use of a trusted server containing all the original records in order to perform the anonymization process. While this is a strength of the randomization method, A General Survey of Privacy-Preserving Data Mining Models and Algorithms 15 it also leads to some weaknesses, since it treats all records equally irrespective of their local density. Therefore, outlier records are more susceptible to adver- sarial attacks as compared to records in more dense regions in the data [10]. In order to guard against this, one may need to be needlessly more aggressive in adding noise to all the records in the data. This reduces the utility of the data for mining purposes. The randomization method has been extended to a variety of data mining problems. In [2], it was discussed how to use the approach for classification. A number of other techniques [143, 145] have also been proposed which seem to work well over a variety of different classifiers. Techniques have also been pro- posed for privacy-preserving methods of improving the effectiveness of classi- fiers. For example, the work in [51] proposes methods for privacy-preserving boosting of classifiers. Methods for privacy-preserving mining of association rules have been proposed in [47, 107]. The problem of association rules is especially challenging because of the discrete nature of the attributes corre- sponding to presence or absence of items. In order to deal with this issue, the randomization technique needs to be modified slightly. Instead of adding quan- titative noise, random items are dropped or included with a certain probability. The perturbed transactions are then used for aggregate association rule mining. This technique has shown to be extremely effective in [47]. The randomization approach has also been extended to other applications such as OLAP [3], and SVD based collaborative filtering [103]. 2.2.1 Privacy Quantification The quantity used to measure privacy should indicate how closely the orig- inal value of an attribute can be estimated. The work in [2] uses a measure that defines privacy as follows: If the original value can be estimated with c% confidence to lie in the interval [α1,α2], then the interval width (α2 − α1) defines the amount of privacy at c% confidence level. For example, if the per- turbing additive is uniformly distributed in an interval of width 2α,thenα is the amount of privacy at confidence level 50% and 2α is the amount of privacy at confidence level 100%. However, this simple method of determining privacy can be subtly incomplete in some situations. This can be best explained by the following example. Example 2.1 Consider an attribute X with the density function fX(x) given by: fX(x)=0.50≤ x ≤ 1 0.54≤ x ≤ 5 0 otherwise 16 Privacy-Preserving Data Mining: Models and Algorithms Assume that the perturbing additive Y is distributed uniformly between [−1, 1]. Then according to the measure proposed in [2], the amount of privacy is 2 at confidence level 100%. However, after performing the perturbation and subsequent reconstruction, the density function fX(x) will be approximately revealed. Let us assume for a moment that a large amount of data is available, so that the distribution function is revealed to a high degree of accuracy. Since the (distribution of the) perturbing additive is publically known, the two pieces of information can be combined to determine that if Z ∈ [−1, 2],thenX ∈ [0, 1]; whereas if Z ∈ [3, 6] then X ∈ [4, 5]. Thus, in each case, the value of X can be localized to an interval of length 1. This means that the actual amount of privacy offered by the perturbing additive Y is at most 1 at confidence level 100%. We use the qualifier ‘at most’ since X can often be localized to an interval of length less than one. For example, if the value of Z happens to be −0.5, then the value of X can be localized to an even smaller interval of [0, 0.5]. This example illustrates that the method suggested in [2] does not take into account the distribution of original data. In other words, the (aggregate) re- construction of the attribute value also provides a certain level of knowledge which can be used to guess a data value to a higher level of accuracy. To accu- rately quantify privacy, we need a method which takes such side-information into account. A key privacy measure [5] is based on the differential entropy of a random variable. The differential entropy h(A) of a random variable A is defined as follows: h(A)=− ΩA fA(a)log2 fA(a) da (2.1) where ΩA is the domain of A. It is well-known that h(A) is a measure of uncertainty inherent in the value of A[111]. It can be easily seen that for a random variable U distributed uniformly between 0 and a, h(U)=log2(a). For a =1, h(U)=0. In [5], it was proposed that 2h(A) is a measure of privacy inherent in the random variable A. This value is denoted by Π(A). Thus, a random variable U distributed uniformly between 0 and a has privacy Π(U)=2log2(a) = a.Fora general random variable A, Π(A) denote the length of the interval, over which a uniformly distributed random variable has the same uncertainty as A. Given a random variable B,theconditional differential entropy of A is de- fined as follows: h(A|B)=− ΩA,B fA,B(a, b)log2 fA|B=b(a) da db (2.2) A General Survey of Privacy-Preserving Data Mining Models and Algorithms 17 Thus, the average conditional privacy of A given B is Π(A|B)=2h(A|B).This motivates the following metric P(A|B) for the conditional privacy loss of A, given B: P(A|B)=1− Π(A|B)/Π(A)=1− 2h(A|B)/2h(A) =1− 2−I(A;B). where I(A;B)=h(A) − h(A|B)=h(B) − h(B|A).I(A;B) is also known as the mutual information between the random variables A and B. Clearly, P(A|B) is the fraction of privacy of A which is lost by revealing B. As an illustration, let us reconsider Example 2.1 given above. In this case, the differential entropy of X is given by: h(X)=− ΩX fX(x)log2 fX(x) dx = − 1 0 0.5log2 0.5 dx − 5 4 0.5log2 0.5 dx =1 Thus the privacy of X, Π(X)=21 =2. In other words, X hasasmuchprivacy as a random variable distributed uniformly in an interval of length 2. The den- sity function of the perturbed value Z is given by fZ(z)= ∞ −∞ fX(ν)fY(z − ν) dν. Using fZ(z), we can compute the differential entropy h(Z) of Z. It turns out that h(Z)=9/4. Therefore, we have: I(X;Z)=h(Z) − h(Z|X)=9/4 − h(Y)=9/4 − 1=5/4 Here, the second equality h(Z|X)=h(Y) follows from the fact that X and Y are independent and Z = X + Y. Thus, the fraction of privacy loss in this case is P(X|Z)=1− 2−5/4 =0.5796. Therefore, after revealing Z,X has privacy Π(X|Z)=Π(X) × (1 −P(X|Z)) = 2 × (1.0 − 0.5796) = 0.8408. This value is less than 1, since X can be localized to an interval of length less than one for many values of Z. The problem of privacy quantification has been studied quite extensively in the literature, and a variety of metrics have been proposed to quantify privacy. A number of quantification issues in the measurement of privacy breaches has been discussed in [46, 48]. In [19], the problem of privacy-preservation has been studied from the broader context of the tradeoff between the privacy and the information loss. We note that the quantification of privacy alone is not suf- ficient without quantifying the utility of the data created by the randomization process. A framework has been proposed to explore this tradeoff for a variety of different privacy transformation algorithms. 18 Privacy-Preserving Data Mining: Models and Algorithms 2.2.2 Adversarial Attacks on Randomization In the earlier section on privacy quantification, we illustrated an example in which the reconstructed distribution on the data can be used in order to reduce the privacy of the underlying data record. In general, a systematic approach can be used to do this in multi-dimensional data sets with the use of spectral filtering or PCA based techniques [54, 66]. The broad idea in techniques such as PCA [54] is that the correlation structure in the original data can be esti- mated fairly accurately (in larger data sets) even after noise addition. Once the broad correlation structure in the data has been determined, one can then try to remove the noise in the data in such a way that it fits the aggregate corre- lation structure of the data. It has been shown that such techniques can reduce the privacy of the perturbation process significantly since the noise removal results in values which are fairly close to their original values [54, 66]. Some other discussions on limiting breaches of privacy in the randomization method may be found in [46]. A second kind of adversarial attack is with the use of public information. Consider a record X =(x1 ...xd), which is perturbed to Z =(z1 ...zd). Then, since the distribution of the perturbations is known, we can try to use a maximum likelihood fit of the potential perturbation of Z to a public record. Consider the publicly public record W =(w1 ...wd). Then, the potential per- turbation of Z with respect to W is given by (Z −W)=(z1 −w1 ...zd −wd). Each of these values (zi − wi) should fit the distribution fY(y). The corre- sponding log-likelihood fit is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood fit, the greater the probability that the record W corresponds to X. If it is known that the public data set always includes X, then the max- imum likelihood fit can provide a high degree of certainty in identifying the correct record, especially in cases where d is large. We will discuss this issue in greater detail in a later section. 2.2.3 Randomization Methods for Data Streams The randomization approach is particularly well suited to privacy-preserving data mining of streams, since the noise added to a given record is independent of the rest of the data. However, streams provide a particularly vulnerable target for adversarial attacks with the use of PCA based techniques [54] because of the large volume of the data available for analysis. In [78], an interesting technique for randomization has been proposed which uses the auto-correlations in different time series while deciding the noise to be added to any particular value. It has been shown in [78] that such an approach is more robust since the noise correlates with the stream behavior, and it is more difficult to create effective adversarial attacks with the use of correlation analysis techniques. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 19 2.2.4 Multiplicative Perturbations The most common method of randomization is that of additive perturba- tions. However, multiplicative perturbations can also be used to good effect for privacy-preserving data mining. Many of these techniques derive their roots in the work of [61] which shows how to use multi-dimensional projections in or- der to reduce the dimensionality of the data. This technique preserves the inter- record distances approximately, and therefore the transformed records can be used in conjunction with a variety of data mining applications. In particular, the approach is discussed in detail in [97, 98], in which it is shown how to use the method for privacy-preserving clustering. The technique can also be applied to the problem of classification as discussed in [28]. Multiplicative perturba- tions can also be used for distributed privacy-preserving data mining. Details can be found in [81]. A number of techniques for multiplicative perturbation in the context of masking census data may be found in [70]. A variation on this theme may be implemented with the use of distance preserving fourier transforms, which work effectively for a variety of cases [91]. As in the case of additive perturbations, multiplicative perturbations are not entirely safe from adversarial attacks. In general, if the attacker has no prior knowledge of the data, then it is relatively difficult to attack the privacy of the transformation. However, with some prior knowledge, two kinds of attacks are possible [82]: Known Input-Output Attack: In this case, the attacker knows some linearly independent collection of records, and their corresponding per- turbed version. In such cases, linear algebra techniques can be used to reverse-engineer the nature of the privacy preserving transformation. Known Sample Attack: In this case, the attacker has a collection of independent data samples from the same distribution from which the original data was drawn. In such cases, principal component analysis techniques can be used in order to reconstruct the behavior of the original data. 2.2.5 Data Swapping We note that noise addition or multiplication is not the only technique which can be used to perturb the data. A related method is that of data swapping, in which the values across different records are swapped in order to perform the privacy-preservation [49]. One advantage of this technique is that the lower order marginal totals of the data are completely preserved and are not per- turbed at all. Therefore certain kinds of aggregate computations can be exactly performed without violating the privacy of the data. We note that this tech- nique does not follow the general principle in randomization which allows the 20 Privacy-Preserving Data Mining: Models and Algorithms value of a record to be perturbed independent;y of the other records. There- fore, this technique can be used in combination with other frameworks such as k-anonymity, as long as the swapping process is designed to preserve the definitions of privacy for that model. 2.3 Group Based Anonymization The randomization method is a simple technique which can be easily im- plemented at data collection time, because the noise added to a given record is independent of the behavior of other data records. This is also a weakness be- cause outlier records can often be difficult to mask. Clearly, in cases in which the privacy-preservation does not need to be performed at data-collection time, it is desirable to have a technique in which the level of inaccuracy depends upon the behavior of the locality of that given record. Another key weakness of the randomization framework is that it does not consider the possibility that publicly available records can be used to identify the identity of the owners of that record. In [10], it has been shown that the use of publicly available records can lead to the privacy getting heavily compromised in high-dimensional cases. This is especially true of outlier records which can be easily distinguished from other records in their locality. Therefore, a broad approach to many privacy transformations is to construct groups of anonymous records which are trans- formed in a group-specific way. 2.3.1 The k-Anonymity Framework In many applications, the data records are made available by simply remov- ing key identifiers such as the name and social-security numbers from personal records. However, other kinds of attributes (known as pseudo-identifiers) can be used in order to accurately identify the records. Foe example, attributes such as age, zip-code and sex are available in public records such as census rolls. When these attributes are also available in a given data set, they can be used to infer the identity of the corresponding individual. A combination of these attributes can be very powerful, since they can be used to narrow down the possibilities to a small number of individuals. In k-anonymity techniques [110], we reduce the granularity of representa- tion of these pseudo-identifiers with the use of techniques such as general- ization and suppression. In the method of generalization, the attribute values are generalized to a range in order to reduce the granularity of representation. For example, the date of birth could be generalized to a range such as year of birth, so as to reduce the risk of identification. In the method of suppression, the value of the attribute is removed completely. It is clear that such methods reduce the risk of identification with the use of public records, while reducing the accuracy of applications on the transformed data. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 21 In order to reduce the risk of identification, the k-anonymity approach re- quires that every tuple in the table be indistinguishability related to no fewer than k respondents. This can be formalized as follows: Definition 2.2 Each release of the data must be such that every combina- tion of values of quasi-identifiers can be indistinguishably matched to at least k respondents. The first algorithm for k-anonymity was proposed in [110]. The approach uses domain generalization hierarchies of the quasi-identifiers in order to build k-anonymous tables. The concept of k-minimal generalization has been pro- posed in [110] in order to limit the level of generalization for maintaining as much data precision as possible for a given level of anonymity. Subsequently, the topic of k-anonymity has been widely researched. A good overview and survey of the corresponding algorithms may be found in [31]. We note that the problem of optimal anonymization is inherently a difficult one. In [89], it has been shown that the problem of optimal k-anonymization is NP-hard. Nevertheless, the problem can be solved quite effectively by the use of a number of heuristic methods. A method proposed by Bayardo and Agrawal [18] is the k-Optimize algorithm which can often obtain effective solutions. The approach assumes an ordering among the quasi-identifier attributes. The values of the attributes are discretized into intervals (quantitative attributes) or grouped into different sets of values (categorical attributes). Each such group- ing is an item. For a given attribute, the corresponding items are also ordered. An index is created using these attribute-interval pairs (or items) and a set enumeration tree is constructed on these attribute-interval pairs. This set enu- meration tree is a systematic enumeration of all possible generalizations with the use of these groupings. The root of the node is the null node, and every successive level of the tree is constructed by appending one item which is lex- icographically larger than all the items at that node of the tree. We note that the number of possible nodes in the tree increases exponentially with the data dimensionality. Therefore, it is not possible to build the entire tree even for modest values of n.However,thek-Optimize algorithm can use a number of pruning strategies to good effect. In particular, a node of the tree can be pruned when it is determined that no descendent of it could be optimal. This can be done by computing a bound on the quality of all descendents of that node, and comparing it to the quality of the current best solution obtained during the traversal process. A branch and bound technique can be used to successively improve the quality of the solution during the traversal process. Eventually, it is possible to terminate the algorithm at a maximum computational time, and use the current solution at that point, which is often quite good, but may not be optimal. 22 Privacy-Preserving Data Mining: Models and Algorithms In [75], the Incognito method has been proposed for computing a k-minimal generalization with the use of bottom-up aggregation along domain generaliza- tion hierarchies. The Incognito method uses a bottom-up breadth-first search of the domain generalization hierarchy, in which it generates all the possible mini- mal k-anonymous tables for a given private table. First, it checks k-anonymity for each single attribute, and removes all those generalizations which do not satisfy k-anonymity. Then, it computes generalizations in pairs, again pruning those pairs which do not satisfy the k-anonymity constraints. In general, the Incognito algorithm computes (i +1)-dimensional generalization candidates from the i-dimensional generalizations, and removes all those those generaliza- tions which do not satisfy the k-anonymity constraint. This approach is contin- ued until, no further candidates can be constructed, or all possible dimensions have been exhausted. We note that the methods in [76, 75] use a more gen- eral model for k-anonymity than that in [110]. This is because the method in [110] assumes that the value generalization hierarchy is a tree, whereas that in [76, 75] assumes that it is a graph. Two interesting methods for top-down specialization and bottom-up gener- alization for k-anonymity have been proposed in [50, 125]. In [50], a top-down heuristic is designed, which starts with a general solution, and then special- izes some attributes of the current solution so as to increase the information, but reduce the anonymity. The reduction in anonymity is always controlled, so that k-anonymity is never violated. At the same time each step of the spe- cialization is controlled by a goodness metric which takes into account both the gain in information and the loss in anonymity. A complementary method to top down specialization is that of bottom up generalization,forwhichan interesting method is proposed in [125]. We note that generalization and suppression are not the only transformation techniques for implementing k-anonymity. For example in [38] it is discussed how to use micro-aggregation in which clusters of records are constructed. For each cluster, its representative value is the average value along each dimen- sion in the cluster. A similar method for achieving anonymity via clustering is proposed in [15]. The work in [15] also provides constant factor approxi- mation algorithms to design the clustering. In [8], a related method has been independently proposed for condensation based privacy-preserving data min- ing. This technique generates pseudo-data from clustered groups of k-records. The process of pseudo-data generation uses principal component analysis of the behavior of the records within a group. It has been shown in [8], that the approach can be effectively used for the problem of classification. We note that the use of pseudo-data provides an additional layer of protection, since it is difficult to perform adversarial attacks on synthetic data. At the same time, the aggregate behavior of the data is preserved, and this can be useful for a variety of data mining problems. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 23 Since the problem of k-anonymization is essentially a search over a space of possible multi-dimensional solutions, standard heuristic search techniques such as genetic algorithms or simulated annealing can be effectively used. Such a technique has been proposed in [130] in which a simulated annealing algo- rithm is used in order to generate k-anonymous representations of the data. An- other technique proposed in [59] uses genetic algorithms in order to construct k-anonymous representations of the data. Both of these techniques require high computational times, and provide no guarantees on the quality of the solutions found. The only known techniques which provide guarantees on the quality of the solution are approximation algorithms [13, 14, 89], in which the solu- tion found is guaranteed to be within a certain factor of the cost of the opti- mal solution. An approximation algorithm for k-anonymity was proposed in [89], and it provides an O(k · logk) optimal solution. A number of techniques have also been proposed in [13, 14], which provide O(k)-approximations to the optimal cost k-anonymous solutions. In [100], a large improvement was proposed over these different methods. The technique in [100] proposes an O(log(k))-approximation algorithm. This is significantly better than compet- ing algorithms. Furthermore, the work in [100] also proposes a O(β · log(k)) approximation algorithm, where the parameter β can be gracefully adjusted based on running time constraints. Thus, this approach not only provides an approximation algorithm, but also gracefully explores the tradeoff between ac- curacy and running time. In many cases, associations between pseudo-identifiers and sensitive at- tributes can be protected by using multiple views, such that the pseudo- identifiers and sensitive attributes occur in different views of the table. Thus, only a small subset of the selected views may be made available. It may be possible to achieve k-anonymity because of the lossy nature of the join across the two views. In the event that the join is not lossy enough, it may result in a violation of k-anonymity. In [140], the problem of violation of k-anonymity using multiple views has been studied. It has been shown that the problem is NP-hard in general. It has been shown in [140] that a polynomial time algorithm is possible if functional dependencies exist between the different views. An interesting analysis of the safety of k-anonymization methods has been discussed in [73]. It tries to model the effectiveness of a k-anonymous rep- resentation, given that the attacker has some prior knowledge about the data such as a sample of the original data. Clearly, the more similar the sample data is to the true data, the greater the risk. The technique in [73] uses this fact to construct a model in which it calculates the expected number of items iden- tified. This kind of technique can be useful in situations where it is desirable 24 Privacy-Preserving Data Mining: Models and Algorithms to determine whether or not anonymization should be used as the technique of choice for a particular situation. 2.3.2 Personalized Privacy-Preservation Not all individuals or entities are equally concerned about their privacy. For example, a corporation may have very different constraints on the privacy of its records as compared to an individual. This leads to the natural problem that we may wish to treat the records in a given data set very differently for anonymiza- tion purposes. From a technical point of view, this means that the value of k for anonymization is not fixed but may vary with the record. A condensation- based approach [9] has been proposed for privacy-preserving data mining in the presence of variable constraints on the privacy of the data records. This technique constructs groups of non-homogeneous size from the data, such that it is guaranteed that each record lies in a group whose size is at least equal to its anonymity level. Subsequently, pseudo-data is generated from each group so as to create a synthetic data set with the same aggregate distribution as the original data. Another interesting model of personalized anonymity is discussed in [132] in which a person can specify the level of privacy for his or her sensitive values. This technique assumes that an individual can specify a node of the domain generalization hierarchy in order to decide the level of anonymity that he can work with. This approach has the advantage that it allows for direct protection of the sensitive values of individuals than a vanilla k-anonymity method which is susceptible to different kinds of attacks. 2.3.3 Utility Based Privacy Preservation The process of privacy-preservation leads to loss of information for data mining purposes. This loss of information can also be considered a loss of utility for data mining purposes. Since some negative results [7] on the curse of dimensionality suggest that a lot of attributes may need to be suppressed in order to preserve anonymity, it is extremely important to do this carefully in order to preserve utility. We note that many anonymization methods [18, 50, 83, 126] use cost measures in order to measure the information loss from the anonymization process. examples of such utility measures include gener- alization height [18], size of anonymized group [83], discernability measures of attribute values [18], and privacy information loss ratio[126]. In addition, a number of metrics such as the classification metric [59] explicitly try to per- form the privacy-preservation in such a way so as to tailor the results with use for specific applications such as classification. The problem of utility-based privacy-preserving data mining was first stud- ied formally in [69]. The broad idea in [69] is to ameliorate the curse of A General Survey of Privacy-Preserving Data Mining Models and Algorithms 25 dimensionality by separately publishing marginal tables containing attributes which have utility, but are also problematic for privacy-preservation purposes. The generalizations performed on the marginal tables and the original tables in fact do not need to be the same. It has been shown that this broad approach can preserve considerable utility of the data set without violating privacy. A method for utility-based data mining using local recoding was proposed in [135]. The approach is based on the fact that different attributes have different utility from an application point of view. Most anonymization methods are global, in which a particular tuple value is mapped to the same generalized value globally. In local recoding, the data space is partitioned into a number of regions, and the mapping of the tuple to the generalizes value is local to that region. Clearly, this kind of approach has greater flexibility, since it can tailor the generalization process to a particular region of the data set. In [135], it has been shown that this method can perform quite effectively because of its local recoding strategy. Another indirect approach to utility based anonymization is to make the privacy-preservation algorithms more aware of the workload [77]. Typically, data recipients may request only a subset of the data in many cases, and the union of these different requested parts of the data set is referred to as the workload. Clearly, a workload in which some records are used more frequently than others tends to suggest a different anonymization than one which is based on the entire data set. In [77], an effective and efficient algorithm has been proposed for workload aware anonymization. Another direction for utility based privacy-preserving data mining is to anonymize the data in such a way that it remains useful for particular kinds of data mining or database applications. In such cases, the utility measure is often affected by the underlying application at hand. For example, in [50], a method has been proposed for k-anonymization using an information-loss metric as the utility measure. Such an approach is useful for the problem of classification. In [72], a method has been proposed for anonymization, so that the accuracy of the underlying queries is preserved. 2.3.4 Sequential Releases Privacy-preserving data mining poses unique problems for dynamic appli- cations such as data streams because in such cases, the data is released sequen- tially. In other cases, different views of the table may be released sequentially. Once a data block is released, it is no longer possible to go back and increase the level of generalization. On the other hand, new releases may sharpen an attacker’s view of the data and may make the overall data set more susceptible to attack. For example, when different views of the data are released sequen- tially, then one may use a join on the two releases [127] in order to sharpen the 26 Privacy-Preserving Data Mining: Models and Algorithms ability to distinguish particular records in the data. A technique discussed in [127] relies on lossy joins in order to cripple an attack based on global quasi- identifiers. The intuition behind this approach is that if the join is lossy enough, it will reduce the confidence of the attacker in relating the release from previ- ous views to the current release. Thus, the inability to link successive releases is key in preventing further discovery of the identity of records. While the work in [127] explores the issue of sequential releases from the point of view of adding additional attributes, the work in [134] discusses the same issue when records are added to or deleted from the original data. A new generalization principle called m-invariance is proposed, which effec- tively limits the risk of privacy-disclosure in re-publication. Another method for handling sequential updates to the data set is discussed in [101]. The broad idea in this approach is to progressively and consistently increase the gen- eralization granularity, so that the released data satisfies the k-anonymity re- quirement both with respect to the current table, as well as with respect to the previous releases. 2.3.5 The l-diversity Method The k-anonymity is an attractive technique because of the simplicity of the definition and the numerous algorithms available to perform the anonymiza- tion. Nevertheless the technique is susceptible to many kinds of attacks espe- cially when background knowledge is available to the attacker. Some kinds of such attacks are as follows: Homogeneity Attack: In this attack, all the values for a sensitive at- tribute within a group of k records are the same. Therefore, even though the data is k-anonymized, the value of the sensitive attribute for that group of k records can be predicted exactly. Background Knowledge Attack: In this attack, the adversary can use an association between one or more quasi-identifier attributes with the sensitive attribute in order to narrow down possible values of the sensi- tive field further. An example given in [83] is one in which background knowledge of low incidence of heart attacks among Japanese could be used to narrow down information for the sensitive field of what disease a patient might have. A detailed discussion of the effects of background knowledge on privacy may be found in [88]. Clearly, while k-anonymity is effective in preventing identification of a record, it may not always be effective in preventing inference of the sensitive val- ues of the attributes of that record. Therefore, the technique of l-diversity was proposed which not only maintains the minimum group size of k, but also A General Survey of Privacy-Preserving Data Mining Models and Algorithms 27 focusses on maintaining the diversity of the sensitive attributes. Therefore, the l-diversity model [83] for privacy is defined as follows: Definition 2.3 Let a q∗-block be a set of tuples such that its non-sensitive values generalize to q∗.Aq∗-block is l-diverse if it contains l “well repre- sented” values for the sensitive attribute S. A table is l-diverse, if every q∗- block in it is l-diverse. A number of different instantiations for the l-diversity definition are discussed in [83]. We note that when there are multiple sensitive attributes, then the l- diversity problem becomes especially challenging because of the curse of di- mensionality. Methods have been proposed in [83] for constructing l-diverse tables from the data set, though the technique remains susceptible to the curse of dimensionality [7]. Other methods for creating l-diverse tables are discussed in [133], in which a simple and efficient method for constructing the l-diverse representation is proposed. 2.3.6 The t-closeness Model The t-closeness model is a further enhancement on the concept of l-diversity. One characteristic of the l-diversity model is that it treats all values of a given attribute in a similar way irrespective of its distribution in the data. This is rarely the case for real data sets, since the attribute values may be very skewed. This may make it more difficult to create feasible l-diverse representations. Often, an adversary may use background knowledge of the global distribution in order to make inferences about sensitive values in the data. Furthermore, not all values of an attribute are equally sensitive. For example, an attribute corre- sponding to a disease may be more sensitive when the value is positive, rather than when it is negative. In [79], a t-closeness model was proposed which uses the property that the distance between the distribution of the sensitive attribute within an anonymized group should not be different from the global distribution by more than a threshold t. The Earth Mover distance metric is used in order to quantify the distance between the two distributions. Further- more, the t-closeness approach tends to be more effective than many other privacy-preserving data mining methods for the case of numeric attributes. 2.3.7 Models for Text, Binary and String Data Most of the work on privacy-preserving data mining is focussed on numer- ical or categorical data. However, specific data domains such as strings, text, or market basket data may share specific properties with some of these general data domains, but may be different enough to require their own set of tech- niques for privacy-preservation. Some examples are as follows: 28 Privacy-Preserving Data Mining: Models and Algorithms Text and Market Basket Data: While these can be considered a case of text and market basket data, they are typically too high dimensional to work effectively with standard k-anonymization techniques. However, these kinds of data sets have the special property that they are extremely sparse. The sparsity property implies that only a few of the attributes are non-zero, and most of the attributes take on zero values. In [11], tech- niques have been proposed to construct anonymization methods which take advantage of this sparsity. In particular sketch based methods have been used to construct anonymized representations of the data. Varia- tions are proposed to construct anonymizations which may be used at data collection time. String Data: String Data is considered challenging because of the vari- ations in the lengths of strings across different records. Typically meth- ods for k-anonymity are attribute specific, and therefore constructions of anonymizations for variable length records are quite difficult. In [12], a condensation based method has been proposed for anonymization of string data. This technique creates clusters from the different strings, and then generates synthetic data which has the same aggregate properties as the individual clusters. Since each cluster contains at least k-records, the anonymized data is guaranteed to at least satisfy the definitions of k-anonymity. 2.4 Distributed Privacy-Preserving Data Mining The key goal in most distributed methods for privacy-preserving data min- ing is to allow computation of useful aggregate statistics over the entire data set without compromising the privacy of the individual data sets within the dif- ferent participants. Thus, the participants may wish to collaborate in obtaining aggregate results, but may not fully trust each other in terms of the distribution of their own data sets. For this purpose, the data sets may either be horizontally partitioned or be vertically partitioned. In horizontally partitioned data sets, the individual records are spread out across multiple entities, each of which have the same set of attributes. In vertical partitioning, the individual entities may have different attributes (or views) of the same set of records. Both kinds of partitioning pose different challenges to the problem of distributed privacy- preserving data mining. The problem of distributed privacy-preserving data mining overlaps closely with a field in cryptography for determining secure multi-party computations. A broad overview of the intersection between the fields of cryptography and privacy-preserving data mining may be found in [102]. The broad approach to cryptographic methods tends to compute functions over inputs provided by multiple recipients without actually sharing the inputs with one another. For A General Survey of Privacy-Preserving Data Mining Models and Algorithms 29 example, in a 2-party setting, Alice and Bob may have two inputs x and y respectively, and may wish to both compute the function f(x, y) without re- vealing x or y to each other. This problem can also be generalized across k parties by designing the k argument function h(x1 ...xk). Many data mining algorithms may be viewed in the context of repetitive computations of many such primitive functions such as the scalar dot product, secure sum etc. In order to compute the function f(x, y) or h(x1 ...,xk),aprotocol will have to de- signed for exchanging information in such a way that the function is computed without compromising privacy. We note that the robustness of the protocol de- pends upon the level of trust one is willing to place on the two participants Alice and Bob. This is because the protocol may be subjected to various kinds of adversarial behavior: Semi-honest Adversaries: In this case, the participants Alice and Bob are curious and attempt to learn from the information received by them during the protocol, but do not deviate from the protocol themselves. In many situations, this may be considered a realistic model of adversarial behavior. Malicious Adversaries: In this case, Alice and Bob may vary from the protocol, and may send sophisticated inputs to one another to learn from the information received from each other. A key building-block for many kinds of secure function evaluations is the 1 out of 2 oblivious-transfer protocol. This protocol was proposed in [45, 105] and involves two parties: a sender,andareceiver. The sender’s input is a pair (x0,x1), and the receiver’s input is a bit value σ ∈{0, 1}. At the end of the process, the receiver learns xσ only, and the sender learns nothing. A number of simple solutions can be designed for this task. In one solution [45, 53], the receiver generates two random public keys, K0 and K1, but the receiver knows only the decryption key for Kσ. The receiver sends these keys to the sender, who encrypts x0 with K0, x1 with K1, and sends the encrypted data back to the receiver. At this point, the receiver can only decrypt xσ, since this is the only input for which they have the decryption key. We note that this is a semi- honest solution, since the intermediate steps require an assumption of trust. For example, it is assumed that when the receiver sends two keys to the sender, they indeed know the decryption key to only one of them. In order to deal with the case of malicious adversaries, one must ensure that the sender chooses the public keys according to the protocol. An efficient method for doing so is described in [94]. In [94], generalizations of the 1 out of 2 oblivious transfer protocol to the 1 out N case and k out of N case are described. Since the oblivious transfer protocol is used as a building block for secure multi-party computation, it may be repeated many times over a given function 30 Privacy-Preserving Data Mining: Models and Algorithms evaluation. Therefore, the computational effectiveness of the approach is im- portant. Efficient methods for both semi-honest and malicious adversaries are discussed in [94]. More complex problems in this domain include the com- putation of probabilistic functions over a number of multi-party inputs [137]. Such powerful techniques can be used in order to abstract out the primitives from a number of computationally intensive data mining problems. Many of the above techniques have been described for the 2-party case, though generic solutions also exist for the multiparty case. Some important solutions for the multiparty case may be found in [25]. The oblivious transfer protocol can be used in order to compute several data mining primitives related to vector distances in multi-dimensional space. A classic problem which is often used as a primitive for many other problems is that of computing the scalar dot-product in a distributed environment [58]. A fairly general set of methods in this direction are described in [39]. Many of these techniques work by sending changed or encrypted versions of the inputs to one another in order to compute the function with the different alternative versions followed by an oblivious transfer protocol to retrieve the correct value of the final output. A systematic framework is described in [39] to transform normal data mining problems to secure multi-party computation problems. The problems discussed in [39] include those of clustering, classification, associ- ation rule mining, data summarization, and generalization. A second set of methods for distributed privacy-preserving data mining is discussed in [32] in which the secure multi-party computation of a number of important data min- ing primitives is discussed. These methods include the secure sum, the secure set union, the secure size of set intersection and the scalar product. These tech- niques can be used as data mining primitives for secure multi-party computa- tion over a variety of horizontally and vertically partitioned data sets. Next, we will discuss algorithms for secure multi-party computation over horizontally partitioned data sets. 2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets In horizontally partitioned data sets, different sites contain different sets of records with the same (or highly overlapping) set of attributes which are used for mining purposes. Many of these techniques use specialized versions of the general methods discussed in [32, 39] for various problems. The work in [80] discusses the construction of a popular decision tree induction method called ID3 with the use of approximations of the best splitting attributes. Subsequently, a variety of classifiers have been generalized to the problem of horizontally-partitioned privacy preserving mining including the Naive Bayes Classifier [65], and the SVM Classifier with nonlinear kernels [141]. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 31 An extreme solution for the horizontally partitioned case is discussed in [139], in which privacy-preserving classification is performed in a fully distributed setting, where each customer has private access to only their own record. A host of other data mining applications have been generalized to the problem of horizontally partitioned data sets. These include the applications of asso- ciation rule mining [64], clustering [57, 62, 63] and collaborative filtering [104]. Methods for cooperative statistical analysis using secure multi-party computation methods are discussed in [40, 41]. A related problem is that of information retrieval and document indexing in a network of content providers. This problem arises in the context of multi- ple providers which may need to cooperate with one another in sharing their content, but may essentially be business competitors. In [17], it has been dis- cussed how an adversary may use the output of search engines and content providers in order to reconstruct the documents. Therefore, the level of trust required grows with the number of content providers. A solution to this prob- lem [17] constructs a centralized privacy-preserving index in conjunction with a distributed access control mechanism. The privacy-preserving index main- tains strong privacy guarantees even in the face of colluding adversaries, and even if the entire index is made public. 2.4.2 Distributed Algorithms over Vertically Partitioned Data For the vertically partitioned case, many primitive operations such as com- puting the scalar product or the secure set size intersection can be useful in computing the results of data mining algorithms. For example, the methods in [58] discuss how to use to scalar dot product computation for frequent itemset counting. The process of counting can also be achieved by using the secure size of set intersection as described in [32]. Another method for association rule mining discussed in [119] uses the secure scalar product over the vertical bit representation of itemset inclusion in transactions, in order to compute the frequency of the corresponding itemsets. This key step is applied repeatedly within the framework of a roll up procedure of itemset counting. It has been shown in [119] that this approach is quite effective in practice. The approach of vertically partitioned mining has been extended to a variety of data mining applications such as decision trees [122], SVM Classification [142], Naive Bayes Classifier [121], and k-means clustering [120]. A num- ber of theoretical results on the ability to learn different kinds of functions in vertically partitioned databases with the use of cryptographic approaches are discussed in [42]. 32 Privacy-Preserving Data Mining: Models and Algorithms 2.4.3 Distributed Algorithms for k-Anonymity In many cases, it is important to maintain k-anonymity across different dis- tributed parties. In [60], a k-anonymous protocol for data which is vertically partitioned across two parties is described. The broad idea is for the two parties to agree on the quasi-identifier to generalize to the same value before release. A similar approach is discussed in [128], in which the two parties agree on how the generalization is to be performed before release. In [144], an approach has been discussed for the case of horizontally par- titioned data. The work in [144] discusses an extreme case in which each site is a customer which owns exactly one tuple from the data. It is assumed that the data record has both sensitive attributes and quasi-identifier attributes. The solution uses encryption on the sensitive attributes. The sensitive values can be decrypted only if therefore are at least k records with the same values on the quasi-identifiers. Thus, k-anonymity is maintained. The issue of k-anonymity is also important in the context of hiding iden- tification in the context of distributed location based services [20, 52]. In this case, k-anonymity of the user-identity is maintained even when the location in- formation is released. Such location information is often released when a user may send a message at any point from a given location. A similar issue arises in the context of communication protocols in which the anonymity of senders (or receivers) may need to be protected. A message is said to be sender k-anonymous, if it is guaranteed that an attacker can at most narrow down the identity of the sender to k individuals. Similarly, a message is said to be receiver k-anonymous, if it is guaranteed that an attacker can at most narrow down the identity of the receiver to k individuals. A number of such techniques have been discussed in [56, 135, 138]. 2.5 Privacy-Preservation of Application Results In many cases, the output of applications can be used by an adversary in or- der to make significant inferences about the behavior of the underlying data. In this section, we will discuss a number of miscellaneous methods for privacy- preserving data mining which tend to preserve the privacy of the end results of applications such as association rule mining and query processing. This prob- lem is related to that of disclosure control [1] in statistical databases, though advances in data mining methods provide increasingly sophisticated methods for adversaries to make inferences about the behavior of the underlying data. In cases, where the commercial data needs to be shared, the association rules may represent sensitive information for target-marketing purposes, which needs to be protected from inference. In this section, we will discuss the issue of disclosure control for a num- ber of applications such as association rule mining, classification, and query A General Survey of Privacy-Preserving Data Mining Models and Algorithms 33 processing. The key goal here is to prevent adversaries from making infer- ences from the end results of data mining and management applications. A broad discussion of the security and privacy implications of data mining are presented in [33]. We will discuss each of the applications below: 2.5.1 Association Rule Hiding Recent years have seen tremendous advances in the ability to perform asso- ciation rule mining effectively. Such rules often encode important target mar- keting information about a business. Some of the earliest work on the chal- lenges of association rule mining for database security may be found in [16]. Two broad approaches are used for association rule hiding: Distortion: In distortion [99], the entry for a given transaction is mod- ified to a different value. Since, we are typically dealing with binary transactional data sets, the entry value is flipped. Blocking: In blocking [108], the entry is not modified, but is left in- complete. Thus, unknown entry values are used to prevent discovery of association rules. We note that both the distortion and blocking processes have a number of side effects on the non-sensitive rules in the data. Some of the non-sensitive rules may be lost along with sensitive rules, and new ghost rules may be created because of the distortion or blocking process. Such side effects are undesirable since they reduce the utility of the data for mining purposes. A formal proof of the NP-hardness of the distortion method for hiding as- sociation rule mining may be found in [16]. In [16], techniques are proposed for changing some of the 1-values to 0-values so that the support of the corre- sponding sensitive rules is appropriately lowered. The utility of the approach was defined by the number of non-sensitive rules whose support was also low- ered by using such an approach. This approach was extended in [34] in which both support and confidence of the appropriate rules could be lowered. In this case, 0-values in the transactional database could also change to 1-values. In many cases, this resulted in spurious association rules (or ghost rules) which was an undesirable side effect of the process. A complete description of the various methods for data distortion for association rule hiding may be found in [124]. Another interesting piece of work which balances privacy and disclosure concerns of sanitized rules may be found in [99]. The broad idea of blocking was proposed in [23]. The attractiveness of the blocking approach is that it maintains the truthfulness of the underlying data, since it replaces a value with an unknown (often represented by ‘?’) rather than a false value. Some interesting algorithms for using blocking for associa- tion rule hiding are presented in [109]. The work has been further extended in 34 Privacy-Preserving Data Mining: Models and Algorithms [108] with a discussion of the effectiveness of reconstructing the hidden rules. Another interesting set of techniques for association rule hiding with limited side effects is discussed in [131]. The objective of this method is to reduce the loss of non-sensitive rules, or the creation of ghost rules during the rule hiding process. In [6], it has been discussed how blocking techniques for hiding association rules can be used to prevent discovery of sensitive entries in the data set by an adversary. In this case, certain entries in the data are classified as sensitive, and only rules which disclose such entries are hidden. An efficient depth-first association mining algorithm is proposed for this task [6]. It has been shown that the methods can effectively reduce the disclosure of sensitive entries with the use of such a hiding process. 2.5.2 Downgrading Classifier Effectiveness An important privacy-sensitive application is that of classification, in which the results of a classification application may be sensitive information for the owner of a data set. Therefore the issue is to modify the data in such a way that the accuracy of the classification process is reduced, while retaining the utility of the data for other kinds of applications. A number of techniques have been discussed in [24, 92] in reducing the classifier effectiveness in context of classification rule and decision tree applications. The notion of parsimonious downgrading is proposed [24] in the context of blocking out inference chan- nels for classification purposes while mining the effect to the overall utility. A system called Rational Downgrader [92] was designed with the use of these principles. The methods for association rule hiding can also be generalized to rule based classifiers. This is because rule based classifiers often use association rule min- ing methods as subroutines, so that the rules with the class labels in their con- sequent are used for classification purposes. For a classifier downgrading ap- proach, such rules are sensitive rules, whereas all other rules (with non-class attributes in the consequent) are non-sensitive rules. An example of a method for rule based classifier downgradation is discussed in [95] in which it has been shown how to effectively hide classification rules for a data set. 2.5.3 Query Auditing and Inference Control Many sensitive databases are not available for public access, but may have a public interface through which aggregate querying is allowed. This leads to the natural danger that a smart adversary may pose a sequence of queries through which he or she may infer sensitive facts about the data. The nature of this inference may correspond to full disclosure, in which an adversary may determine the exact values of the data attributes. A second notion is that of A General Survey of Privacy-Preserving Data Mining Models and Algorithms 35 partial disclosure in which the adversary may be able to narrow down the values to a range, but may not be able to guess the exact value. Most work on query auditing generally concentrates on the full disclosure setting. Two broad approaches are designed in order to reduce the likelihood of sen- sitive data discovery: Query Auditing: In query auditing, we deny one or more queries from a sequence of queries. The queries to be denied are chosen such that the sensitivity of the underlying data is preserved. Some examples of query auditing methods include [37, 68, 93, 106]. Query Inference Control: In this case, we perturb the underlying data or the query result itself. The perturbation is engineered in such a way, so as to preserve the privacy of the underlying data. Examples of meth- ods which use perturbation of the underlying data include [3, 26, 90]. Examples of methods which perturb the query result include [22, 36, 42–44]. An overview of classical methods for query auding may be found in [1]. The query auditing problem has an online version, in which we do not know the se- quence of queries in advance, and an offline version, in which we do know this sequence in advance. Clearly, the offline version is open to better optimization from an auditing point of view. The problem of query auditing was first studied in [37, 106]. This approach works for the online version of the query auditing problem. In these works, the sum query is studied, and privacy is protected by using restrictions on sizes and pairwise overlaps of the allowable queries. Let us assume that the query size is restricted to be at most k, and the number of common elements in pairwise query sets is at most m. Then, if q be the number of elements that the attacker already knows from background knowledge, it was shown that [37, 106] that the maximum number of queries allowed is (2 · k − (q +1))/m. We note that if N be the total number of data elements, the above expression is always bounded above by 2·N. If for some constant c, we choose k = N/c and m =1, the approach can only support a constant number of queries, after which all queries would have to be denied by the auditor. Clearly, this is undesirable from an application point of view. Therefore, a considerable amount of research has been devoted to increasing the number of queries which can be answered by the auditor without compromising privacy. In [67], the problem of sum auditing on sub-cubes of the data cube are stud- ied, where a query expression is constructed using a string of 0, 1, and *. The elements to be summed up are determined by using matches to the query string pattern. In [71], the problem of auditing a database of boolean values is studied for the case of sum and max queries. In [21], and approach for query auditing 36 Privacy-Preserving Data Mining: Models and Algorithms is discussed which is actually a combination of the approach of denying some queries and modifying queries in order to achieve privacy. In [68], the authors show that denials to queries depending upon the answer to the current query can leak information. The authors introduce the notion of simulatable auditing for auditing sum and max queries. In [93], the authors devise methods for auditing max queries and bags of max and min queries under the partial and full disclosure settings. The authors also examine the notion of utility in the context of auditing, and obtain results for sum queries in the full disclosure setting. A number of techniques have also been proposed for the offline version of the auditing problem. In [29], a number of variations of the offline audit- ing problem have been studied. In the offline auditing problem, we are given a sequence of queries which have been truthfully answered, and we need to determine if privacy has been breached. In [29], effective algorithms were pro- posed for the sum, max, and max and min versions of the problems. On the other hand, the sum and max version of the problem was shown to be NP-hard. In [4], an offline auditing framework was proposed for determining whether a database adheres to its disclosure properties. The key idea is to create an audit expression which specifies sensitive table entries. A number of techniques have also been proposed for sanitizing or random- izing the data for query auditing purposes. These are fairly general models of privacy, since they preserve the privacy of the data even when the entire data- base is available. The standard methods for perturbation [2, 5] or k-anonymity [110] can always be used, and it is always guaranteed that an adversary may not derive anything more from the queries than they can from the base data. Thus, since a k-anonymity model guarantees a certain level of privacy even when the entire database is made available, it will continue to do so under any sequence of queries. In [26], a number of interesting methods are discussed for measuring the effectiveness of sanitization schemes in terms of balancing privacy and utility. Instead of sanitizing the base data, it is possible to use summary constructs on the data, and respond to queries using only the information encoded in the summary constructs. Such an approach preserves privacy, as long as the summary constructs do not reveal sensitive information about the underly- ing records. A histogram based approach to data sanitization has been dis- cussed in [26, 27]. In this technique the data is recursively partitioned into multi-dimensional cells. The final output is the exact description of the cuts along with the population of each cell. Clearly, this kind of description can be used for approximate query answering with the use of standard histogram query processing methods. In [55], a method has been proposed for privacy- preserving indexing of multi-dimensional data by using bucketizing of the un- derlying attribute values in conjunction with encryption of identification keys. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 37 We note that a choice of larger bucket sizes provides greater privacy but less accuracy. Similarly, optimizing the bucket sizes for accuracy can lead to reduc- tions in privacy. This tradeoff has been studied in [55], and it has been shown that reasonable query precision can be maintained at the expense of partial disclosure. In the class of methods which use summarization structures for inference control, an interesting method was proposed by Mishra and Sandler in [90], which uses pseudo-random sketches for privacy-preservation. In this technique sketches are constructed from the data, and the sketch representations are used to respond to user queries. In [90], it has been shown that the scheme preserves privacy effectively, while continuing to be useful from a utility point of view. Finally, an important class of query inference control methods changes the results of queries in order to preserve privacy. A classical method for aggre- gate queries such as the sum or relative frequency is that of random sampling [35]. In this technique, a random sample of the data is used to compute such aggregate functions. The random sampling approach makes it impossible for the questioner to precisely control the formation of query sets. The advantage of using a random sample is that the results of large queries are quite robust (in terms of relative error), but the privacy of individual records are preserved because of high absolute error. Another method for query inference control is by adding noise to the results of queries. Clearly, the noise should be sufficient that an adversary cannot use small changes in the query arguments in order to infer facts about the base data. In [44], an interesting technique has been presented in which the result of a query is perturbed by an amount which depends upon the underlying sen- sitivity of the query function. This sensitivity of the query function is defined approximately by the change in the response to the query by changing one ar- gument to the function. An important theoretical result [22, 36, 42, 43] shows that a surprisingly small amount of noise needs to be added to the result of a query, provided that the number of queries is sublinear in the number of data- base rows. With increasing sizes of databases today, this result provides fairly strong guarantees on privacy. Such queries together with their slightly noisy responses are referred to as the SuLQ primitive. 2.6 Limitations of Privacy: The Curse of Dimensionality Many privacy-preserving data-mining methods are inherently limited by the curse of dimensionality in the presence of public information. For exam- ple, the technique in [7] analyzes the k-anonymity method in the presence of increasing dimensionality. The curse of dimensionality becomes especially important when adversaries may have considerable background information, as a result of which the boundary between pseudo-identifiers and sensitive 38 Privacy-Preserving Data Mining: Models and Algorithms attributes may become blurred. This is generally true, since adversaries may be familiar with the subject of interest and may have greater information about them than what is publicly available. This is also the motivation for techniques such as l-diversity [83] in which background knowledge can be used to make further privacy attacks. The work in [7] concludes that in order to maintain privacy, a large number of the attributes may need to be suppressed. Thus, the data loses its utility for the purpose of data mining algorithms. The broad intuition behind the result in [7] is that when attributes are generalized into wide ranges, the combination of a large number of generalized attributes is so sparsely populated, that even two anonymity becomes increasingly unlikely. While the method of l-diversity has not been formally analyzed, some obser- vations made in [83] seem to suggest that the method becomes increasingly infeasible to implement effectively with increasing dimensionality. The method of randomization has also been analyzed in [10]. This pa- per makes a first analysis of the ability to re-identify data records with the use of maximum likelihood estimates. Consider a d-dimensional record X =(x1 ...xd), which is perturbed to Z =(z1 ...zd). For a given pub- lic record W =(w1 ...wd), we would like to find the probability that it could have been perturbed to Z using the perturbing distribution fY(y).Ifthiswere true, then the set of values given by (Z − W)=(z1 − w1 ...zd − wd) should be all drawn from the distribution fY(y). The corresponding log-likelihood fit is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood fit, the greater the probability that the record W corresponds to X. In order to achieve greater anonymity, we would like the perturbations to be large enough, so that some of the spurious records in the data have greater log-likelihood fit to Z than the true record X. It has been shown in [10], that this probability reduces rapidly with increasing dimensionality for different kinds of perturbing distri- butions. Thus, the randomization technique also seems to be susceptible to the curse of high dimensionality. We note that the problem of high dimensionality seems to be a fundamental one for privacy preservation, and it is unlikely that more effective methods can be found in order to preserve privacy when background information about a large number of features is available to even a subset of selected individuals. Indirect examples of such violations occur with the use of trail identifications [84, 85], where information from multiple sources can be compiled to create a high dimensional feature representation which violates privacy. 2.7 Applications of Privacy-Preserving Data Mining The problem of privacy-preserving data mining has numerous applications in homeland security, medical database mining, and customer transaction analysis. Some of these applications such as those involving bio-terrorism A General Survey of Privacy-Preserving Data Mining Models and Algorithms 39 and medical database mining may intersect in scope. In this section, we will discuss a number of different applications of privacy-preserving data mining methods. 2.7.1 Medical Databases: The Scrub and Datafly Systems The scrub system [118] was designed for de-identification of clinical notes and letters which typically occurs in the form of textual data. Clinical notes and letters are typically in the form of text which contain references to pa- tients, family members, addresses, phone numbers or providers. Traditional techniques simply use a global search and replace procedure in order to pro- vide privacy. However clinical notes often contain cryptic references in the form of abbreviations which may only be understood either by other providers or members of the same institution. Therefore traditional methods can iden- tify no more than 30-60% of the identifying information in the data [118]. The Scrub system uses numerous detection algorithms which compete in parallel to determine when a block of text corresponds to a name, address or a phone number. The Scrub System uses local knowledge sources which compete with one another based on the certainty of their findings. It has been shown in [118] that such a system is able to remove more than 99% of the identifying infor- mation from the data. The Datafly System [117] was one of the earliest practical applications of privacy-preserving transformations. This system was designed to prevent iden- tification of the subjects of medical records which may be stored in multi- dimensional format. The multi-dimensional information may include directly identifying information such as the social security number, or indirectly iden- tifying information such as age, sex or zip-code. The system was designed in response to the concern that the process of removing only directly identify- ing attributes such as social security numbers was not sufficient to guarantee privacy. While the work has a similar motive as the k-anonymity approach of preventing record identification, it does not formally use a k-anonymity model in order to prevent identification through linkage attacks. The approach works by setting a minimum bin size for each field. The anonymity level is defined in Datafly with respect to this bin size. The values in the records are thus gener- alized to the ambiguity level of a bin size as opposed to exact values. Directly, identifying attributes such as the social-security-number, name, or zip-code are removed from the data. Furthermore, outlier values are suppressed from the data in order to prevent identification. Typically, the user of Datafly will set the anonymity level depending upon the profile of the data recipient in ques- tion. The overall anonymity level is defined between 0 and 1, which defines the minimum bin size for each field. An anonymity level of 0 results in Datafly providing the original data, whereas an anonymity level of 1 results in the 40 Privacy-Preserving Data Mining: Models and Algorithms maximum level of generalization of the underlying data. Thus, these two val- ues provide two extreme values of trust and distrust. We note that these values are set depending upon the recipient of the data. When the records are released to the public, it is desirable to set of higher level of anonymity in order to ensure the maximum amount of protection. The generalizations in the datafly system are typically done independently at the individual attribute level, since the bins are defined independently for different attributes. The Datafly system is one of the earliest systems for anonymization, and is quite simple in its ap- proach to anonymization. A lot of work in the anonymity field has been done since the creation of the Datafly system, and there is considerable scope for enhancement of the Datafly system with the use of these models. 2.7.2 Bioterrorism Applications In typical bioterrorism applications, we would like to analyze medical data for privacy-preserving data mining purposes. Often a biological agent such as anthrax produces symptoms which are similar to other common respiratory diseases such as the cough, cold and the flu. In the absence of prior knowl- edge of such an attack, health care providers may diagnose a patient affected by an anthrax attack of have symptoms from one of the more common res- piratory diseases. The key is to quickly identify a true anthrax attack from a normal outbreak of a common respiratory disease, In many cases, an unusual number of such cases in a given locality may indicate a bio-terrorism attack. Therefore, in order to identify such attacks it is necessary to track incidences of these common diseases as well. Therefore, the corresponding data would need to be reported to public health agencies. However, the common respira- tory diseases are not reportable diseases by law. The solution proposed in [114] is that of “selective revelation” which initially allows only limited access to the data. However, in the event of suspicious activity, it allows a “drill-down” into the underlying data. This provides more identifiable information in accordance with public health law. 2.7.3 Homeland Security Applications A number of applications for homeland security are inherently intrusive be- cause of the very nature of surveillance. In [113], a broad overview is provided on how privacy-preserving techniques may be used in order to deploy these applications effectively without violating user privacy. Some examples of such applications are as follows: Credential Validation Problem: In this problem, we are trying to match the subject of the credential to the person presenting the credential. For example, the theft of social security numbers presents a serious threat to homeland security. In the credential validation approach [113], an A General Survey of Privacy-Preserving Data Mining Models and Algorithms 41 attempt is made to exploit the semantics associated with the social se- curity number to determine whether the person presenting the SSN cre- dential truly owns it. Identity Theft: A related technology [115] is to use a more active ap- proach to avoid identity theft. The identity angel system [115], crawls through cyberspace, and determines people who are at risk from iden- tity theft. This information can be used to notify appropriate parties. We note that both the above approaches to prevention of identity theft are relatively non-invasive and therefore do not violate privacy. Web Camera Surveillance: One possible method for surveillance is with the use of publicly available webcams [113, 116], which can be used to detect unusual activity. We note that this is a much more invasive approach than the previously discussed techniques because of person- specific information being captured in the webcams. The approach can be made more privacy-sensitive by extracting only facial count informa- tion from the images and using these in order to detect unusual activity. It has been hypothesized in [116] that unusual activity can be detected only in terms of facial count rather than using more specific informa- tion about particular individuals. In effect, this kind of approach uses a domain-specific downgrading of the information available in the web- cams in order to make the approach privacy-sensitive. Video-Surveillance: In the context of sharing video-surveillance data, a major threat is the use of facial recognition software, which can match the facial images in videos to the facial images in a driver license data- base. While a straightforward solution is to completely black out each face, the result is of limited new, since all facial information has been wiped out. A more balanced approach [96] is to use selective downgrad- ing of the facial information, so that it scientifically limits the ability of facial recognition software to reliably identify faces, while maintaining facial details in images. The algorithm is referred to as k-Same, and the key is to identify faces which are somewhat similar, and then construct new faces which construct combinations of features from these similar faces. Thus, the identity of the underlying individual is anonymized to a certain extent, but the video continues to remain useful. Thus, this ap- proach has the flavor of a k-anonymity approach, except that it creates new synthesized data for the application at hand. The Watch List Problem: The motivation behind this problem [113] is that the government typically has a list of known terrorists or suspected entities which it wishes to track from the population. The aim is to view transactional data such as store purchases, hospital admissions, airplane 42 Privacy-Preserving Data Mining: Models and Algorithms manifests, hotel registrations or school attendance records in order to identify or track these entities. This is a difficult problem because the transactional data is private, and the privacy of subjects who do not ap- pear in the watch list need to be protected. Therefore, the transactional behavior of non-suspicious subjects may not be identified or revealed. Furthermore, the problem is even more difficult if we assume that the watch list cannot be revealed to the data holders. The second assumption is a result of the fact that members on the watch list may only be sus- pected entities and should have some level of protection from identifica- tion as suspected terrorists to the general public. The watch list problem is currently an open problem [113]. 2.7.4 Genomic Privacy Recent years have seen tremendous advances in the science of DNA se- quencing and forensic analysis with the use of DNA. As result, the databases of collected DNA are growing very fast in the both the medical and law en- forcement communities. DNA data is considered extremely sensitive, since it contains almost uniquely identifying information about an individual. As in the case of multi-dimensional data, simple removal of directly iden- tifying data such as social security number is not sufficient to prevent re- identification. In [86], it has been shown that a software called CleanGene can determine the identifiability of DNA entries independent of any other de- mographic or other identifiable information. The software relies on publicly available medical data and knowledge of particular diseases in order to as- sign identifications to DNA entries. It was shown in [86] that 98-100% of the individuals are identifiable using this approach. The identification is done by taking the DNA sequence of an individual and then constructing a genetic pro- file corresponding to the sex, genetic diseases, the location where the DNA was collected etc. This genetic profile has been shown in [86] to be quite effec- tive in identifying the individual to a much smaller group. One way to protect the anonymity of such sequences is with the use of generalization lattices [87] which are constructed in such a way that an entry in the modified database cannot be distinguished from at least (k − 1) other entities. Another approach discussed in [11] constructs synthetic data which preserves the aggregate char- acteristics of the original data, but preserves the privacy of the original records. Another method for compromising the privacy of genomic data is that of trail re-identification, in which the uniqueness of patient visit patterns [84, 85] is exploited in order to make identifications. The premise of this work is that pa- tients often visit and leave behind genomic data at various distributed locations and hospitals. The hospitals usually separate out the clinical data from the ge- nomic data and make the genomic data available for research purposes. While the data is seemingly anonymous, the visit location pattern of the patients is A General Survey of Privacy-Preserving Data Mining Models and Algorithms 43 encoded in the site from which the data is released. It has been shown in [84, 85] that this information may be combined with publicly available data in order to perform unique re-identifications. Some broad ideas for protecting the privacy in such scenarios are discussed in [85]. 2.8 Summary In this paper, we presented a survey of the broad areas of privacy-preserving data mining and the underlying algorithms. We discussed a variety of data modification techniques such as randomization and k-anonymity based tech- niques. We discussed methods for distributed privacy-preserving mining, and the methods for handling horizontally and vertically partitioned data. We dis- cussed the issue of downgrading the effectiveness of data mining and data management applications such as association rule mining, classification, and query processing. We discussed some fundamental limitations of the problem of privacy-preservation in the presence of increased amounts of public infor- mation and background knowledge. Finally, we discussed a number of diverse application domains for which privacy-preserving data mining methods are useful. References [1] Adam N., Wortmann J. C.: Security-Control Methods for Statistical Databases: A Comparison Study. ACM Computing Surveys, 21(4), 1989. [2] Agrawal R., Srikant R. Privacy-Preserving Data Mining. Proceedings of the ACM SIGMOD Conference, 2000. [3] Agrawal R., Srikant R., Thomas D. Privacy-Preserving OLAP. Proceed- ings of the ACM SIGMOD Conference, 2005. [4] Agrawal R., Bayardo R., Faloutsos C., Kiernan J., Rantzau R., Srikant R.: Auditing Compliance via a hippocratic database. VLDB Conference, 2004. [5] Agrawal D. Aggarwal C. C. On the Design and Quantification of Privacy-Preserving Data Mining Algorithms. ACM PODS Conference, 2002. [6] Aggarwal C., Pei J., Zhang B. A Framework for Privacy Preservation against Adversarial Data Mining. ACM KDD Conference, 2006. [7] Aggarwal C. C. On k-anonymity and the curse of dimensionality. VLDB Conference, 2005. [8] Aggarwal C. C., Yu P. S.: A Condensation approach to privacy preserv- ing data mining. EDBT Conference, 2004. [9] Aggarwal C. C., Yu P. S.: On Variable Constraints in Privacy-Preserving Data Mining. SIAM Conference, 2005. 44 Privacy-Preserving Data Mining: Models and Algorithms [10] Aggarwal C. C.: On Randomization, Public Information and the Curse of Dimensionality. ICDE Conference, 2007. [11] Aggarwal C. C., Yu P. S.: On Privacy-Preservation of Text and Sparse Binary Data with Sketches. SIAM Conference on Data Mining, 2007. [12] Aggarwal C. C., Yu P. S. On Anonymization of String Data. SIAM Con- ference on Data Mining, 2007. [13] Aggarwal G., Feder T., Kenthapadi K., Motwani R., Panigrahy R., Thomas D., Zhu A.: Anonymizing Tables. ICDT Conference, 2005. [14] Aggarwal G., Feder T., Kenthapadi K., Motwani R., Panigrahy R., Thomas D., Zhu A.: Approximation Algorithms for k-anonymity. Jour- nal of Privacy Technology, paper 20051120001, 2005. [15] Aggarwal G., Feder T., Kenthapadi K., Khuller S., Motwani R., Pan- igrahy R., Thomas D., Zhu A.: Achieving Anonymity via Clustering. ACM PODS Conference, 2006. [16] Atallah, M., Elmagarmid, A., Ibrahim, M., Bertino, E., Verykios, V.: Disclosure limitation of sensitive rules, Workshop on Knowledge and Data Engineering Exchange, 1999. [17] Bawa M., Bayardo R. J., Agrawal R.: Privacy-Preserving Indexing of Documents on the Network. VLDB Conference, 2003. [18] Bayardo R. J., Agrawal R.: Data Privacy through Optimal k- Anonymization. Proceedings of the ICDE Conference, pp. 217–228, 2005. [19] Bertino E., Fovino I., Provenza L.: A Framework for Evaluating Privacy-Preserving Data Mining Algorithms. Data Mining and Knowl- edge Discovery Journal, 11(2), 2005. [20] Bettini C., Wang X. S., Jajodia S.: Protecting Privacy against Location Based Personal Identification. Proc. of Secure Data Management Work- shop, Trondheim, Norway, 2005. [21] Biskup J., Bonatti P.: Controlled Query Evaluation for Known Policies by Combining Lying and Refusal. Annals of Mathematics and Artificial Intelligence, 40(1-2), 2004. [22] Blum A., Dwork C., McSherry F., Nissim K.: Practical Privacy: The SuLQ Framework. ACM PODS Conference, 2005. [23] Chang L., Moskowitz I.: An integrated framwork for database inference and privacy protection. Data and Applications Security. Kluwer, 2000. [24] Chang L., Moskowitz I.: Parsimonious downgrading and decision trees applied to the inference problem. New Security Paradigms Workshop, 1998. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 45 [25] Chaum D., Crepeau C., Damgard I.: Multiparty unconditionally secure protocols. ACM STOC Conference, 1988. [26] Chawla S., Dwork C., McSherry F., Smith A., Wee H.: Towards Privacy in Public Databases, TCC, 2005. [27] Chawla S., Dwork C., McSherry F., Talwar K.: On the Utility of Privacy- Preserving Histograms, UAI, 2005. [28] Chen K., Liu L.: Privacy-preserving data classification with rotation per- turbation. ICDM Conference, 2005. [29] Chin F.: Security Problems on Inference Control for SUM, MAX, and MIN Queries. J. of the ACM, 33(3), 1986. [30] Chin F., Ozsoyoglu G.: Auditing for Secure Statistical Databases. Pro- ceedings of the ACM’81 Conference, 1981. [31] Ciriani V., De Capitiani di Vimercati S., Foresti S., Samarati P.: k-Anonymity. Security in Decentralized Data Management, ed. Jajodia S., Yu T., Springer, 2006. [32] Clifton C., Kantarcioglou M., Lin X., Zhu M.: Tools for privacy- preserving distributed data mining. ACM SIGKDD Explorations, 4(2), 2002. [33] Clifton C., Marks D.: Security and Privacy Implications of Data Min- ing., Workshop on Data Mining and Knowledge Discovery, 1996. [34] Dasseni E., Verykios V., Elmagarmid A., Bertino E.: Hiding Association Rules using Confidence and Support, 4th Information Hiding Workshop, 2001. [35] Denning D.: Secure Statistical Databases with Random Sample Queries. ACM TODS Journal, 5(3), 1980. [36] Dinur I., Nissim K.: Revealing Information while preserving privacy. ACM PODS Conference, 2003. [37] Dobkin D., Jones A., Lipton R.: Secure Databases: Protection against User Influence. ACM Transactions on Databases Systems, 4(1), 1979. [38] Domingo-Ferrer J,, Mateo-Sanz J.: Practical data-oriented micro- aggregation for statistical disclosure control. IEEE TKDE, 14(1), 2002. [39] Du W., Atallah M.: Secure Multi-party Computation: A Review and Open Problems.CERIAS Tech. Report 2001-51, Purdue University, 2001. [40] Du W., Han Y. S., Chen S.: Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification, Proc. SIAM Conf. Data Mining, 2004. [41] Du W., Atallah M.: Privacy-Preserving Cooperative Statistical Analysis, 17th Annual Computer Security Applications Conference, 2001. 46 Privacy-Preserving Data Mining: Models and Algorithms [42] Dwork C., Nissim K.: Privacy-Preserving Data Mining on Vertically Partitioned Databases, CRYPTO, 2004. [43] Dwork C., Kenthapadi K., McSherry F., Mironov I., Naor M.: Our Data, Ourselves: Privacy via Distributed Noise Generation. EUROCRYPT, 2006. [44] Dwork C., McSherry F., Nissim K., Smith A.: Calibrating Noise to Sen- sitivity in Private Data Analysis, TCC, 2006. [45] Even S., Goldreich O., Lempel A.: A Randomized Protocol for Signing Contracts. Communications of the ACM, vol 28, 1985. [46] Evfimievski A., Gehrke J., Srikant R. Limiting Privacy Breaches in Pri- vacy Preserving Data Mining. ACM PODS Conference, 2003. [47] Evfimievski A., Srikant R., Agrawal R., Gehrke J.: Privacy-Preserving Mining of Association Rules. ACM KDD Conference, 2002. [48] Evfimievski A.: Randomization in Privacy-Preserving Data Mining. ACM SIGKDD Explorations, 4, 2003. [49] Fienberg S., McIntyre J.: Data Swapping: Variations on a Theme by Dalenius and Reiss. Technical Report, National Institute of Statistical Sciences, 2003. [50] Fung B., Wang K., Yu P.: Top-Down Specialization for Information and Privacy Preservation. ICDE Conference, 2005. [51] Gambs S., Kegl B., Aimeur E.: Privacy-Preserving Boosting. Knowl- edge Discovery and Data Mining Journal, to appear. [52] Gedik B., Liu L.: A customizable k-anonymity model for protecting location privacy, ICDCS Conference, 2005. [53] Goldreich O.: Secure Multi-Party Computation, Unpublished Manu- script, 2002. [54] Huang Z., Du W., Chen B.: Deriving Private Information from Random- ized Data. pp. 37–48, ACM SIGMOD Conference, 2005. [55] Hore B., Mehrotra S., Tsudik B.: A Privacy-Preserving Index for Range Queries. VLDB Conference, 2004. [56] Hughes D, Shmatikov V.: Information Hiding, Anonymity, and Privacy: A modular Approach. Journal of Computer Security, 12(1), 3–36, 2004. [57] Inan A., Saygin Y., Savas E., Hintoglu A., Levi A.: Privacy-Preserving Clustering on Horizontally Partitioned Data. Data Engineering Work- shops, 2006. [58] Ioannidis I., Grama A., Atallah M.: A secure protocol for computing dot products in clustered and distributed environments, International Con- ference on Parallel Processing, 2002. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 47 [59] Iyengar V. S.: Transforming Data to Satisfy Privacy Constraints. KDD Conference, 2002. [60] Jiang W., Clifton C.: Privacy-preserving distributed k-Anonymity. Pro- ceedings of the IFIP 11.3 Working Conference on Data and Applications Security, 2005. [61] Johnson W., Lindenstrauss J.: Extensions of Lipshitz Mapping into Hilbert Space, Contemporary Math. vol. 26, pp. 189-206, 1984. [62] Jagannathan G., Wright R.: Privacy-Preserving Distributed k-means clustering over arbitrarily partitioned data. ACM KDD Conference, 2005. [63] Jagannathan G., Pillaipakkamnatt K., Wright R.: A New Privacy- Preserving Distributed k-Clustering Algorithm. SIAM Conference on Data Mining, 2006. [64] Kantarcioglu M., Clifton C.: Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data. IEEE TKDE Jour- nal, 16(9), 2004. [65] Kantarcioglu M., Vaidya J.: Privacy-Preserving Naive Bayes Classi- fier for Horizontally Partitioned Data. IEEE Workshop on Privacy- Preserving Data Mining, 2003. [66] Kargupta H., Datta S., Wang Q., Sivakumar K.: On the Privacy Preserv- ing Properties of Random Data Perturbation Techniques. ICDM Confer- ence, pp. 99-106, 2003. [67] Karn J., Ullman J.: A model of statistical databases and their security. ACM Transactions on Database Systems, 2(1):1–10, 1977. [68] Kenthapadi K.,Mishra N., Nissim K.: Simulatable Auditing, ACM PODS Conference, 2005. [69] Kifer D., Gehrke J.: Injecting utility into anonymized datasets. SIGMOD Conference, pp. 217-228, 2006. [70] Kim J., Winkler W.: Multiplicative Noise for Masking Continuous Data, Technical Report Statistics 2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [71] Kleinberg J., Papadimitriou C., Raghavan P.: Auditing Boolean At- tributes. Journal of Computer and System Sciences, 6, 2003. [72] Koudas N., Srivastava D., Yu T., Zhang Q.: Aggregate Query Answering on Anonymized Tables. ICDE Conference, 2007. [73] Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005. [74] Liew C. K., Choi U. J., Liew C. J. A data distortion by probability dis- tribution. ACM TODS, 10(3):395-411, 1985. 48 Privacy-Preserving Data Mining: Models and Algorithms [75] LeFevre K., DeWitt D., Ramakrishnan R.: Incognito: Full Domain K-Anonymity. ACM SIGMOD Conference, 2005. [76] LeFevre K., DeWitt D., Ramakrishnan R.: Mondrian Multidimensional K-Anonymity. ICDE Conference, 25, 2006. [77] LeFevre K., DeWitt D., Ramakrishnan R.: Workload Aware Anonymization. KDD Conference, 2006. [78] Li F., Sun J., Papadimitriou S. Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. [79] Li N., Li T., Venkatasubramanian S: t-Closeness: Orivacy beyond k-anonymity and l-diversity. ICDE Conference, 2007. [80] Lindell Y., Pinkas B.: Privacy-Preserving Data Mining. CRYPTO, 2000. [81] Liu K., Kargupta H., Ryan J.: Random Projection Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering, 18(1), 2006. [82] Liu K., Giannella C. Kargupta H.: An Attacker’s View of Distance Pre- serving Maps for Privacy-Preserving Data Mining. PKDD Conference, 2006. [83] Machanavajjhala A., Gehrke J., Kifer D., and Venkitasubramaniam M.: l-Diversity: Privacy Beyond k-Anonymity. ICDE, 2006. [84] Malin B, Sweeney L. Re-identification of DNA through an automated linkage process. Journal of the American Medical Informatics Associa- tion, pp. 423–427, 2001. [85] Malin B. Why methods for genomic data privacy fail and what we can do to fix it, AAAS Annual Meeting, Seattle, WA, 2004. [86] Malin B., Sweeney L.: Determining the identifiability of DNA database entries. Journal of the American Medical Informatics Association, pp. 537–541, November 2000. [87] Malin, B. Protecting DNA Sequence Anonymity with Generalization Lattices. Methods of Information in Medicine, 44(5): 687-692, 2005. [88] Martin D., Kifer D., Machanavajjhala A., Gehrke J., Halpern J.: Worst- Case Background Knowledge. ICDE Conference, 2007. [89] Meyerson A., Williams R. On the complexity of optimal k-anonymity. ACM PODS Conference, 2004. [90] Mishra N., Sandler M.: Privacy vis Pseudorandom Sketches. ACM PODS Conference, 2006. [91] Mukherjee S., Chen Z., Gangopadhyay S.: A privacy-preserving tech- nique for Euclidean distance-based mining algorithms using Fourier based transforms, VLDB Journal, 2006. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 49 [92] Moskowitz I., Chang L.: A decision theoretic system for information downgrading. Joint Conference on Information Sciences, 2000. [93] Nabar S., Marthi B., Kenthapadi K., Mishra N., Motwani R.: Towards Robustness in Query Auditing. VLDB Conference, 2006. [94] Naor M., Pinkas B.: Efficient Oblivious Transfer Protocols, SODA Con- ference, 2001. [95] Natwichai J., Li X., Orlowska M.: A Reconstruction-based Algorithm for Classification Rules Hiding. Australasian Database Conference, 2006. [96] Newton E., Sweeney L., Malin B.: Preserving Privacy by De-identifying Facial Images. IEEE Transactions on Knowledge and Data Engineer- ing, IEEE TKDE, February 2005. [97] Oliveira S. R. M., Zaane O.: Privacy Preserving Clustering by Data Transformation, Proc. 18th Brazilian Symp. Databases, pp. 304-318, Oct. 2003. [98] Oliveira S. R. M., Zaiane O.: Data Perturbation by Rotation for Privacy- Preserving Clustering, Technical Report TR04-17, Department of Com- puting Science, University of Alberta, Edmonton, AB, Canada, August 2004. [99] Oliveira S. R. M., Zaiane O., Saygin Y.: Secure Association-Rule Shar- ing. PAKDD Conference, 2004. [100] Park H., Shim K. Approximate Algorithms for K-anonymity. ACM SIG- MOD Conference, 2007. [101] Pei J., Xu J., Wang Z., Wang W., Wang K.: Maintaining k-Anonymity against Incremental Updates. Symposium on Scientific and Statistical Database Management, 2007. [102] Pinkas B.: Cryptographic Techniques for Privacy-Preserving Data Min- ing. ACM SIGKDD Explorations, 4(2), 2002. [103] Polat H., Du W.: SVD-based collaborative filtering with privacy. ACM SAC Symposium, 2005. [104] Polat H., Du W.: Privacy-Preserving Top-N Recommendations on Hor- izontally Partitioned Data. Web Intelligence, 2005. [105] Rabin M. O.: How to exchange secrets by oblivious transfer, Technical Report TR-81, Aiken Corporation Laboratory, 1981. [106] Reiss S.: Security in Databases: A combinatorial Study, Journal of ACM, 26(1), 1979. [107] Rizvi S., Haritsa J.: Maintaining Data Privacy in Association Rule Min- ing. VLDB Conference, 2002. 50 Privacy-Preserving Data Mining: Models and Algorithms [108] Saygin Y., Verykios V., Clifton C.: Using Unknowns to prevent discov- ery of Association Rules, ACM SIGMOD Record, 30(4), 2001. [109] Saygin Y., Verykios V., Elmagarmid A.: Privacy-Preserving Association Rule Mining, 12th International Workshop on Research Issues in Data Engineering, 2002. [110] Samarati P.: Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 13(6): 1010-1027 (2001). [111] Shannon C. E.: The Mathematical Theory of Communication, Univer- sity of Illinois Press, 1949. [112] Silverman B. W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [113] Sweeney L.: Privacy Technologies for Homeland Security. Testimony before the Privacy and Integrity Advisory Committee of the Deprtment of Homeland Scurity, Boston, MA, June 15, 2005. [114] Sweeney L.: Privacy-Preserving Bio-terrorism Surveillance. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [115] Sweeney L.: AI Technologies to Defeat Identity Theft Vulnerabilities. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [116] Sweeney L., Gross R.: Mining Images in Publicly-Available Cameras for Homeland Security. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [117] Sweeney L.: Guaranteeing Anonymity while Sharing Data, the Datafly System. Journal of the American Medical Informatics Association, 1997. [118] Sweeney L.: Replacing Personally Identifiable Information in Medical Records, the Scrub System. Journal of the American Medical Informat- ics Association, 1996. [119] Vaidya J., Clifton C.: Privacy-Preserving Association Rule Mining in Vertically Partitioned Databases. ACM KDD Conference, 2002. [120] Vaidya J., Clifton C.: Privacy-Preserving k-means clustering over verti- cally partitioned Data. ACM KDD Conference, 2003. [121] Vaidya J., Clifton C.: Privacy-Preserving Naive Bayes Classifier over vertically partitioned data. SIAM Conference, 2004. [122] Vaidya J., Clifton C.: Privacy-Preserving Decision Trees over vertically partitioned data. Lecture Notes in Computer Science, Vol 3654, 2005. [123] Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y., Theodoridis Y.: State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, v.33 n.1, 2004. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 51 [124] Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.,, Dasseni E.: As- sociation Rule Hiding. IEEE Transactions on Knowledge and Data En- gineering, 16(4), 2004. [125] Wang K., Yu P., Chakraborty S.: Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. ICDM Conference, 2004. [126] Wang K., Fung B. C. M., Yu P. Template based Privacy -Preservation in classification problems. ICDM Conference, 2005. [127] Wang K., Fung B. C. M.: Anonymization for Sequential Releases. ACM KDD Conference, 2006. [128] Wang K., Fung B. C. M., Dong G.: Integarting Private Databases for Data Analysis. Lecture Notes in Computer Science, 3495, 2005. [129] Warner S. L. Randomized Response: A survey technique for eliminat- ing evasive answer bias. Journal of American Statistical Association, 60(309):63–69, March 1965. [130] Winkler W.: Using simulated annealing for k-anonymity. Technical Report 7, US Census Bureau. [131] Wu Y.-H., Chiang C.-M., Chen A. L. P.: Hiding Sensitive Association Rules with Limited Side Effects. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. [132] Xiao X., Tao Y.. Personalized Privacy Preservation. ACM SIGMOD Conference, 2006. [133] Xiao X., Tao Y. Anatomy: Simple and Effective Privacy Preservation. VLDB Conference, pp. 139-150, 2006. [134] Xiao X., Tao Y.: m-Invariance: Towards Privacy-preserving Re- publication of Dynamic Data Sets. SIGMOD Conference, 2007. [135] Xu J., Wang W., Pei J., Wang X., Shi B., Fu A. W. C.: Utility Based Anonymization using Local Recoding. ACM KDD Conference, 2006. [136] Xu S., Yung M.: k-anonymous secret handshakes with reusable cre- dentials. ACM Conference on Computer and Communications Security, 2004. [137] Yao A. C.: How to Generate and Exchange Secrets. FOCS Conferemce, 1986. [138] Yao G., Feng D.: A new k-anonymous message transmission protocol. International Workshop on Information Security Applications, 2004. [139] Yang Z., Zhong S., Wright R.: Privacy-Preserving Classification of Cus- tomer Data without Loss of Accuracy. SDM Conference, 2006. [140] Yao C., Wang S., Jajodia S.: Checking for k-Anonymity Violation by views. ACM Conference on Computer and Communication Security, 2004. 52 Privacy-Preserving Data Mining: Models and Algorithms [141] Yu H., Jiang X., Vaidya J.: Privacy-Preserving SVM using nonlinear Kernels on Horizontally Partitioned Data. SAC Conference, 2006. [142] Yu H., Vaidya J., Jiang X.: Privacy-Preserving SVM Classification on Vertically Partitioned Data. PAKDD Conference, 2006. [143] Zhang P., Tong Y., Tang S., Yang D.: Privacy-Preserving Naive Bayes Classifier. Lecture Notes in Computer Science, Vol 3584, 2005. [144] Zhong S., Yang Z., Wright R.: Privacy-enhancing k-anonymization of customer data, In Proceedings of the ACM SIGMOD-SIGACT-SIGART Principles of Database Systems, Baltimore, MD. 2005. [145] Zhu Y., Liu L. Optimal Randomization for Privacy- Preserving Data Mining. ACM KDD Conference, 2004. Chapter 3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining Josep Domingo-Ferrer∗ Rovira i Virgili University of Tarragona † UNESCO Chair in Data Privacy Dept. of Computer Engineering and Mathematics Av. Pa¬õsos Catalans 26, E-43007 Tarragona, Catalonia josep.domingo@urv.cat Abstract Inference control in databases, also known as Statistical Disclosure Control (SDC), is about protecting data so they can be published without revealing confidential information that can be linked to specific individuals among those to which the data correspond. This is an important application in several areas, such as official statistics, health statistics, e-commerce (sharing of consumer data), etc. Since data protection ultimately means data modification, the challenge for SDC is to achieve protection with minimum loss of the accuracy sought by database users. In this chapter, we survey the current state of the art in SDC methods for protecting individual data (microdata). We discuss several information loss and disclosure risk measures and analyze several ways of combining them to assess the performance of the various methods. Last but not least, topics which need more research in the area are identified and possible directions hinted. Keywords: Privacy, inference control, statistical disclosure control, statistical disclosure limitation, statistical databases, microdata. ∗This work received partial support from the Spanish Ministry of Science and Education through project SEG2004-04352-C04-01 “PROPRIETAS”, the Government of Catalonia under grant 2005 SGR 00446 and Eurostat through the CENEX SDC project. The author is solely responsible for the views ex- pressed in this chapter, which do not necessarily reflect the position of UNESCO nor commit that organiza- tion. †Part of this chapter was written while the author was a Visiting Fellow at Princeton University. 54 Privacy-Preserving Data Mining: Models and Algorithms 3.1 Introduction Inference control in statistical databases, also known as Statistical Disclo- sure Control (SDC) or Statistical Disclosure Limitation (SDL), seeks to protect statistical data in such a way that they can be publicly released and mined with- out giving away private information that can be linked to specific individuals or entities. There are several areas of application of SDC techniques, which include but are not limited to the following: Official statistics. Most countries have legislation which compels na- tional statistical agencies to guarantee statistical confidentiality when they release data collected from citizens or companies. This justifies the research on SDC undertaken by several countries, among them the Eu- ropean Union (e.g. the CASC project[8]) and the United States. Health information. This is one of the most sensitive areas regarding pri- vacy. For example, in the U. S., the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA,[43]) requires the strict reg- ulation of protected health information for use in medical research. In most western countries, the situation is similar. E-commerce. Electronic commerce results in the automated collection of large amounts of consumer data. This wealth of information is very useful to companies, which are often interested in sharing it with their subsidiaries or partners. Such consumer information transfer should not result in public profiling of individuals and is subject to strict regulation; see [28] for regulations in the European Union and [77] for regulations in the U.S. The protection provided by SDC techniques normally entails some degree of data modification, which is an intermediate option between no modification (maximum utility, but no disclosure protection) and data encryption (maximum protection but no utility for the user without clearance). The challenge for SDC is to modify data in such a way that sufficient pro- tection is provided while keeping at a minimum the information loss, i.e. the loss of the accuracy sought by database users. In the years that have elapsed since the excellent survey by [3], the state of the art in SDC has evolved so that now at least three subdisciplines are clearly differentiated: Tabular data protection This is the oldest and best established part of SDC, because tabular data have been the traditional output of na- tional statistical offices. The goal here is to publish static aggregate information, i.e. tables, in such a way that no confidential information on specific individuals among those to which the table refers can be inferred. See [79] for a conceptual survey and [36] for a software survey. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 55 Dynamic databases The scenario here is a database to which the user can sub- mit statistical queries (sums, averages, etc.). The aggregate information obtained by a user as a result of successive queries should not allow him to infer information on specific individuals. Since the 80s, this has been known to be a difficult problem, subject to the tracker attack [69]. One possible strategy is to perturb the answers to queries; solutions based on perturbation can be found in [26], [54] and [76]. If perturbation is not acceptable and exact answers are needed, it may become necessary to refuse answers to certain queries; solutions based on query restriction can be found in [9] and [38]. Finally, a third strategy is to provide correct (unperturbed) interval answers, as done in [37] and [35]. Microdata protection This subdiscipline is about protecting static individual data, also called microdata. It is only recently that data collectors (sta- tistical agencies and the like) have been persuaded to publish microdata. Therefore, microdata protection is the youngest subdiscipline and is ex- periencing continuous evolution in the last years. Good general works on SDC are [79, 45]. This survey will cover the current state of the art in SDC methods for microdata, the most common data used for data mining. First, the main existing methods will be described. Then, we will discuss several information loss and disclosure risk measures and will analyze several approaches to combining them when assessing the performance of the various methods. The comparison metrics being presented should be used as a benchmark for future developments in this area. Open research issues and directions will be suggested at the end of this chapter. Plan of This Chapter Section 3.2 introduces a classification of microdata protection methods. Section 3.3 reviews perturbative masking methods. Section 3.4 reviews non- perturbative masking methods. Section 3.5 reviews methods for synthetic mi- crodata generation. Section 3.6 discusses approaches to trade off information loss for disclosure risk and analyzes their strengths and limitations. Conclusions and directions for future research are summarized in Section 3.7. 3.2 A classification of Microdata Protection Methods A microdata set V can be viewed as a file with n records, where each record contains m attributes on an individual respondent. The attributes can be classi- fied in four categories which are not necessarily disjoint: Identifiers. These are attributes that unambiguously identify the respon- dent. Examples are the passport number, social security number, name- surname, etc. 56 Privacy-Preserving Data Mining: Models and Algorithms Quasi-identifiers or key attributes. These are attributes which identify the respondent with some degree of ambiguity. (Nonetheless, a com- bination of quasi-identifiers may provide unambiguous identification.) Examples are address, gender, age, telephone number, etc. Confidential outcome attributes. These are attributes which contain sen- sitive information on the respondent. Examples are salary, religion, po- litical affiliation, health condition, etc. Non-confidential outcome attributes. Those attributes which do not fall in any of the categories above. Since the purpose of SDC is to prevent confidential information from being linked to specific respondents, we will assume in what follows that original microdata sets to be protected have been pre-processed to remove from them all identifiers. The purpose of microdata SDC mentioned in the previous section can be stated more formally by saying that, given an original microdata set V,the goal is to release a protected microdata set V in such a way that: 1 Disclosure risk (i.e. the risk that a user or an intruder can use V to determine confidential attributes on a specific individual among those in V)islow. 2 User analyses (regressions, means, etc.) on V and on V yield the same or at least similar results. Microdata protection methods can generate the protected microdata set V either by masking original data, i.e. generating V a modified version of the original microdata set V; or by generating synthetic data V that preserve some statistical proper- ties of the original data V. Masking methods can in turn be divided in two categories depending on their effect on the original data [79]: Perturbative. The microdata set is distorted before publication. In this way, unique combinations of scores in the original dataset may disap- pear and new unique combinations may appear in the perturbed dataset; such confusion is beneficial for preserving statistical confidentiality. The perturbation method used should be such that statistics computed on the perturbed dataset do not differ significantly from the statistics that would be obtained on the original dataset. Non-perturbative. Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail in the original A Survey of Inference Control Methods for Privacy-Preserving Data Mining 57 dataset. Global recoding, local suppression and sampling are examples of non-perturbative masking. At a first glance, synthetic data seem to have the philosophical advantage of circumventing the re-identification problem: since published records are in- vented and do not derive from any original record, some authors claim that no individual having supplied original data can complain from having been re-identified. At a closer look, some authors (e.g., [80] and [63]) claim that even synthetic data might contain some records that allow for re-identification of confidential information. In short, synthetic data overfitted to original data might lead to disclosure just as original data would. On the other hand, a clear problem of synthetic data is data utility: only the statistical properties explic- itly selected by the data protector are preserved, which leads to the question whether the data protector should not directly publish the statistics he wants preserved rather than a synthetic microdata set. We will return to these issues in Section 3.5. So far in this section, we have classified microdata protection methods by their operating principle. If we consider the type of data on which they can be used, a different dichotomic classification applies: Continuous. An attribute is considered continuous if it is numerical and arithmetic operations can be performed with it. Examples are income and age. Note that a numerical attribute does not necessarily have an infinite range, as is the case for age. When designing methods to protect continuous data, one has the advantage that arithmetic operations are possible, and the drawback that every combination of numerical values in the original dataset is likely to be unique, which leads to disclosure if no action is taken. Categorical. An attribute is considered categorical when it takes values over a finite set and standard arithmetic operations do not make sense. Ordinal and nominal scales can be distinguished among categorical at- tributes. In ordinal scales the order between values is relevant, whereas in nominal scales it is not. In the former case, max and min operations are meaningful while in the latter case only pairwise comparison is pos- sible. The instruction level is an example of ordinal attribute, whereas eye color is an example of nominal attribute. In fact, all quasi-identifiers in a microdata set are normally categorical nominal. When designing methods to protect categorical data, the inability to perform arithmetic operations is certainly inconvenient, but the finiteness of the value range is one property that can be successfully exploited. 58 Privacy-Preserving Data Mining: Models and Algorithms 3.3 Perturbative Masking Methods Perturbative methods allow for the release of the entire microdata set, al- though perturbed values rather than exact values are released. Not all pertur- bative methods are designed for continuous data; this distinction is addressed further below for each method. Most perturbative methods reviewed below (including additive noise, rank swapping, microaggregation and post-randomization) are special cases of ma- trix masking. If the original microdata set is X, then the masked microdata set Z is computed as Z = AXB + C where A is a record-transforming mask, B is an attribute-transforming mask and C is a displacing mask (noise)[27]. Table 3.1 lists the perturbative methods described below. For each method, the table indicates whether it is suitable for continuous and/or categorical data. 3.3.1 Additive Noise The noise additions algorithms in the literature are: Masking by uncorrelated noise addition. The vector of observations xj for the j-th attribute of the original dataset Xj is replaced by a vector zj = xj + j where j is a vector of normally distributed errors drawn from a random variable εj ∼ N(0,σ2 εj ), such that Cov(εt,εl)=0for all t = l.This does not preserve variances nor correlations. Masking by correlated noise addition. Correlated noise addition also preserves means and additionally allows preservation of correlation co- efficients. The difference with the previous method is that the covariance Table 3.1. Perturbative methods vs data types. “X” denotes applicable and “(X)” denotes ap- plicable with some adaptation Method Continuous data Categorical data Additive noise X Microaggregation X(X) Rank swapping XX Rounding X Resampling X PRAM X MASSC X A Survey of Inference Control Methods for Privacy-Preserving Data Mining 59 matrix of the errors is now proportional to the covariance matrix of the original data, i.e. ε ∼ N(0,Σε),whereΣε = αΣ. Masking by noise addition and linear transformation. In [49], a method is proposed that ensures by additional transformations that the sample covariance matrix of the masked attributes is an unbiased estimator for the covariance matrix of the original attributes. Masking by noise addition and nonlinear transformation. An algorithm combining simple additive noise and nonlinear transformation is pro- posed in [72]. The advantages of this proposal are that it can be ap- plied to discrete attributes and that univariate distributions are preserved. Unfortunately, as justified in [6], the application of this method is very time-consuming and requires expert knowledge on the data set and the algorithm. For more details on specific algorithms, the reader can check [5]. In practice, only simple noise addition (two first variants) or noise addition with linear transformation are used. When using linear transformations, a decision has to be made whether to reveal them to the data user to allow for bias adjustment in the case of subpopulations. With the exception of the not very practical method of [72], additive noise is not suitable to protect categorical data. On the other hand, it is well suited for continuous data for the following reasons: It makes no assumptions on the range of possible values for Vi (which may be infinite). The noise being added is typically continuous and with mean zero, which suits well continuous original data. No exact matching is possible with external files. Depending on the amount of noise added, approximate (interval) matching might be possible. 3.3.2 Microaggregation Microaggregation is a family of SDC techniques for continous microdata. The rationale behind microaggregation is that confidentiality rules in use al- low publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates (i.e. contributes too much to) the group and k is a threshold value. Strict application of such confidentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microag- gregation. 60 Privacy-Preserving Data Mining: Models and Algorithms To obtain microaggregates in a microdata set with n records, these are com- bined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original aver- aged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be pub- lished. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within- group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as SSE = g i=1 ni j=1 (xij − ¯xi)(xij − ¯xi) The lower SSE, the higher the within group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE. For a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes. Also the way to form groups may vary. Several taxonomies are possible to classify the microaggre- gation algorithms in the literature: i) fixed group size [15, 44, 23] vs variable group size [15, 51, 18, 68, 50, 20]; ii) exact optimal (only for the univariate case, [41, 55]) vs heuristic microaggregation; iii) continuous vs categorical microaggregation [75]. To illustrate, we next give a heuristic algorithm called MDAV (Maximum Distance to Average Vector,[23]) for multivariate fixed group size microaggre- gation on unprojected continuous data. We designed and implemented MDAV for the µ-Argus package [44]. Algorithm 3.1 (MDAV) 1 Compute the average record ¯x of all records in the dataset. Consider the most distant record xr to the average record ¯x (using the squared Euclidean distance). 2 Find the most distant record xs from the record xr considered in the previous step. 3 Form two groups around xr and xs, respectively. One group contains xr and the k − 1 records closest to xr. The other group contains xs and the k − 1 records closest to xs. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 61 4 If there are at least 3k records which do not belong to any of the two groups formed in Step 3, go to Step 1 taking as new dataset the previous dataset minus the groups formed in the last instance of Step 3. 5 If there are between 3k−1 and 2k records which do not belong to any of the two groups formed in Step 3: a) compute the average record ¯x of the remaining records; b) find the most distant record xr from ¯x;c)forma group containing xr and the k−1 records closest to xr; d) form another group containing the rest of records. Exit the Algorithm. 6 If there are less than 2k records which do not belong to the groups formed in Step 3, form a new group with those records and exit the Al- gorithm. The above algorithm can be applied independently to each group of at- tributes resulting from partitioning the set of attributes in the dataset. 3.3.3 Data Wapping and Rank Swapping Data swapping was originally presented as an SDC method for databases containing only categorical attributes [11]. The basic idea behind the method is to transform a database by exchanging values of confidential attributes among individual records. Records are exchanged in such a way that low-order fre- quency counts or marginals are maintained. Even though the original procedure was not very used in practice (see [32]), its basic idea had a clear influence in subsequent methods. In [59] and [58] data swapping was introduced to protect continuous and categorical microdata, respectively. Another variant of data swapping for microdata is rank swapping, which will be described next in some detail. Although originally described only for ordinal attributes [40], rank swap- ping can also be used for any numerical attribute [53]. First, values of an attribute Xi are ranked in ascending order, then each ranked value of Xi is swapped with another ranked value randomly chosen within a restricted range (e.g. the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter). This algorithm is indepen- dently used on each original attribute in the original data set. It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap. In earlier empirical work by these authors on continu- ous microdata protection [21], rank swapping has been identified as a particu- larly well-performing method in terms of the tradeoff between disclosure risk and information loss (see Example 3.4 below). Consequently, it is one of the techniques that have been implemented in the µ − Argus package [44]. 62 Privacy-Preserving Data Mining: Models and Algorithms Table 3.2. Example of rank swapping. Left, original file; right, rankswapped file 1K 3.74.4 1H 3.04.8 2L 3.83.4 2L 4.53.2 3N 3.04.8 3M3.74.4 4M4.55.0 4N 5.06.0 5L 5.06.0 5L 4.55.0 6H 6.07.5 6F 6.79.5 7H 4.510.0 7K 3.811.0 8F 6.711.0 8H 6.010.0 9D 8.09.5 9C10.07.5 10 C 10.03.2 10 D 8.03.4 Example 3.2 In Table 3.2, we can see an original microdata set on the left and its rankswapped version on the right. There are four attributes and ten records in the original dataset; the second attribute is alphanumeric, and the standard alphabetic order has been used to rank it. A value of p = 10% has been used for all attributes.  3.3.4 Rounding Rounding methods replace original values of attributes with rounded val- ues. For a given attribute Xi, rounded values are chosen among a set of round- ing points defining a rounding set (often the multiples of a given base value). In a multivariate original dataset, rounding is usually performed one attribute at a time (univariate rounding); however, multivariate rounding is also possi- ble [79, 10]. The operating principle of rounding makes it suitable for contin- uous data. 3.3.5 Resampling Originally proposed for protecting tabular data [42, 17], resampling can also be used for microdata. Take t independent samples S1,··· ,St of the values of an original attribute Xi. Sort all samples using the same ranking criterion. Build the masked attribute Zi as ¯x1,··· , ¯xn,wheren is the number of records and ¯xj is the average of the j-th ranked values in S1,··· ,St. 3.3.6 PRAM The Post-RAndomization Method (PRAM, [39]) is a probabilistic, pertur- bative method for disclosure protection of categorical attributes in microdata files. In the masked file, the scores on some categorical attributes for cer- tain records in the original file are changed to a different score according to a prescribed probability mechanism, namely a Markov matrix. The Markov A Survey of Inference Control Methods for Privacy-Preserving Data Mining 63 approach makes PRAM very general, because it encompasses noise addition, data suppression and data recoding. PRAM information loss and disclosure risk largely depend on the choice of the Markov matrix and are still (open) research topics [14]. The PRAM matrix contains a row for each possible value of each attribute to be protected. This rules out using the method for continuous data. 3.3.7 MASSC MASSC [71] is a masking method whose acronym summarizes its four steps: Micro Agglomeration, Substitution, Subsampling and Calibration. We briefly recall the purpose of those four steps: 1 Micro agglomeration is applied to partition the original dataset into risk strata (groups of records which are at a similar risk of disclosure). These strata are formed using the key attributes, i.e. the quasi-identifiers in the records. The idea is that those records with rarer combinations of key attributes are at a higher risk. 2 Optimal probabilistic substitution is then used to perturb the original data. 3 Optimal probabilistic subsampling is used to suppress some attributes or even entire records. 4 Optimal sampling weight calibration is used to preserve estimates for outcome attributes in the treated database whose accuracy is critical for the intended data use. MASSC in interesting in that, to the best of our knowledge, it is the first at- tempt at designing a perturbative masking method in such a way that disclosure risk can be analytically quantified. Its main shortcoming is that its disclosure model simplifies reality by considering only disclosure resulting from linkage of key attributes with external sources. Since key attributes are typically cate- gorical, the risk of disclosure can be analyzed by looking at the probability that a sample unique is a population unique; however, doing so ignores the fact that continuous outcome attributes can also be used for respondent re-identification via record linkage. As an example, if respondents are companies and turnover is one outcome attribute, everyone in a certain industrial sector knows which is the company with largest turnover. Thus, in practice, MASSC is a method only suited when continuous attributes are not present. 3.4 Non-perturbative Masking Methods Non-perturbative methods do not rely on distortion of the original data but on partial suppressions or reductions of detail. Some of the methods are usable 64 Privacy-Preserving Data Mining: Models and Algorithms Table 3.3. Non-perturbative methods vs data types Method Continuous data Categorical data Sampling X Global recoding XX Top and bottom coding XX Local suppression X on both categorical and continuous data, but others are not suitable for contin- uous data. Table 3.3 lists the non-perturbative methods described below. For each method, the table indicates whether it is suitable for continuous and/or categorical data. 3.4.1 Sampling Instead of publishing the original microdata file, what is published is a sam- ple S of the original set of records [79]. Sampling methods are suitable for categorical microdata, but for continuous microdata they should probably be combined with other masking methods. The reason is that sampling alone leaves a continuous attribute Vi unperturbed for all records in S. Thus, if attribute Vi is present in an external administrative public file, unique matches with the published sample are very likely: indeed, given a continuous attribute Vi and two respondents o1 and o2, it is highly unlikely that Vi will take the same value for both o1 and o2 unless o1 = o2 (this is true even if Vi has been truncated to represent it digitally). If, for a continuous identifying attribute, the score of a respondent is only approximately known by an attacker (as assumed in [78]), it might still make sense to use sampling methods to protect that attribute. However, assumptions on restricted attacker resources are perilous and may prove definitely too opti- mistic if good quality external administrative files are at hand. 3.4.2 Global Recoding This method is also sometimes known as generalization [67, 66]. For a cate- gorical attribute Vi, several categories are combined to form new (less specific) categories, thus resulting in a new V  i with |D(V  i )| < |D(Vi)| where |·|is the cardinality operator. For a continuous attribute, global recoding means re- placing Vi by another attribute V  i which is a discretized version of Vi.Inother words, a potentially infinite range D(Vi) is mapped onto a finite range D(V  i ). This is the technique used in the µ-Argus SDC package [44]. This technique is more appropriate for categorical microdata, where it helps disguise records with strange combinations of categorical attributes. Global recoding is used heavily by statistical offices. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 65 Example 3.3 If there is a record with “Marital status = Widow/er” and “Age = 17”, global recoding could be applied to “Marital status” to create a broader category “Widow/er or divorced”, so that the probability of the above record being unique would diminish. Global recoding can also be used on a continuous attribute, but the inherent discretization leads very often to an unaf- fordable loss of information. Also, arithmetical operations that were straight- forward on the original Vi are no longer easy or intuitive on the discretized V  i .  3.4.3 Top and Bottom Coding Top and bottom coding is a special case of global recoding which can be used on attributes that can be ranked, that is, continuous or categorical ordinal. The idea is that top values (those above a certain threshold) are lumped together to form a new category. The same is done for bottom values (those below a certain threshold). See [44]. 3.4.4 Local Suppression Certain values of individual attributes are suppressed with the aim of in- creasing the set of records agreeing on a combination of key values. Ways to combine local suppression and global recoding are discussed in [16] and im- plemented in the µ-Argus SDC package [44]. If a continuous attribute Vi is part of a set of key attributes, then each com- bination of key values is probably unique. Since it does not make sense to systematically suppress the values of Vi, we conclude that local suppression is rather oriented to categorical attributes. 3.5 Synthetic Microdata Generation Publication of synthetic — i.e. simulated — data was proposed long ago as a way to guard against statistical disclosure. The idea is to randomly generate data with the constraint that certain statistics or internal relationships of the original dataset should be preserved. We next review some approaches in the literature to synthetic data gener- ation and then proceed to discuss the global pros and cons of using synthetic data. 3.5.1 Synthetic Data by Multiple Imputation More than twenty years ago, it was suggested in [65] to create an entirely synthetic dataset based on the original survey data and multiple imputation. Rubin’s proposal was more completely developed in [57]. A simulation study of it was given in [60]. In [64] inference on synthetic data is discussed and in [63] an application is given. 66 Privacy-Preserving Data Mining: Models and Algorithms We next sketch the operation of the original proposal by Rubin. Consider an original microdata set X of size n records drawn from a much larger population of N individuals, where there are background attributes A, non-confidential attributes B and confidential attributes C. Background attributes are observed and available for all N individuals in the population, whereas B and C are only available for the n records in the sample X. The first step is to construct from X a multiply-imputed population of N individuals. This population consists of the n records in X and M(the number of multiple imputations, typically between 3 and 10) matrices of (B,C) data for the N − n non-sampled indi- viduals. The variability in the imputed values ensures, theoretically, that valid inferences can be obtained on the multiply-imputed population. A model for predicting (B,C) from A is used to multiply-impute (B,C) in the popula- tion. The choice of the model is a nontrivial matter. Once the multiply-imputed population is available, a sample Z of n records can be drawn from it whose structure looks like the one a sample of n records drawn from the original population. This can be done M times to create M replicates of (B,C) values. The result are M multiply-imputed synthetic datasets. To make sure no orig- inal data are in the synthetic datasets, it is wise to draw the samples from the multiply-imputed population excluding the n original records from it. 3.5.2 Synthetic Data by Bootstrap Long ago, [30] proposed generating synthetic microdata by using bootstrap methods. Later, in [31] this approach was used for categorical data. The bootstrap approach bears some similarity to the data distortion by probability distribution and the multiple-imputation methods described above. Given an original microdata set X with p attributes, the data protector com- putes its empirical p-variate cumulative distribution function (c.d.f.) F.Now, rather than distorting the original data to obtain masked data (as done by the masking methods in Sections 3.3 and 3.4), the data protector alters (or “smoothes”) the c.d.f. F to derive a similar c.d.f. F . Finally, F  is sampled to obtain a synthetic microdata set Z. 3.5.3 Synthetic Data by Latin Hypercube Sampling Latin Hypercube Sampling (LHS) appears in the literature as another method for generating multivariate synthetic datasets. In [46], the LHS up- dated technique of [33] was improved, but the proposed scheme is still time- intensive even for a moderate number of records. In [12], LHS is used along with a rank correlation refinement to reproduce both the univariate (i.e. mean and covariance) and multivariate structure (in the sense of rank correlation) of the original dataset. In a nutshell, LHS-based methods rely on iterative refinement, are time-intensive and their running time does not only depend on the number of values to be reproduced, but on the starting values as well. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 67 3.5.4 Partially Synthetic Data by Cholesky Decomposition Generating plausible synthetic values for all attributes in a database may be difficult in practice. Thus, several authors have considered mixing actual and synthetic data. In [7], a non-iterative method for generating continuous synthetic microdata is proposed. It consists of three methods sketched next. Informally, suppose two sets of attributes X and Y, where the former are the confidential out- come attributes and the latter are quasi-identifier attributes. Then X are taken as independent and Y as dependent attributes. Conditional on the specific con- fidential attributes xi, the quasi-identifier attributes Yi are assumed to follow a multivariate normal distribution with covariance matrix Σ={σjk} and a mean vector xiB,whereB is a matrix of regression coefficients. Method A computes a multiple regression of Y on X and fitted Y  A at- tributes. Finally, attributes X and Y  A are released in place of X and Y. If a user fits a multiple regression model to (y A,x), she will get estimates ˆBA and ˆΣA which, in general, are different from the estimates ˆB and ˆΣ ob- tained when fitting the model to the original data (y,x). IPSO Method B mod- ifies y A into y B in such a way that the estimate ˆBB obtained by multiple linear regression from (y B,x) satisfies ˆBB = ˆB. A more ambitious goal is to come up with a data matrix y C such that, when a multivariate multiple regression model is fitted to (y C,x), both sufficient statistics ˆB and ˆΣ obtained on the original data (y,x) are preserved. This is achieved by IPSO Method C. 3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches The multiple imputation approach described in [65] for creating entirely synthetic microdata can be extended for partially synthetic microdata. As a result multiply-imputed, partially synthetic datasets are obtained that contain a mix of actual and imputed (synthetic) values. The idea is to multiply- impute confidential values and release non-confidential values without per- turbation. This approach was first applied to protect the Survey of Consumer Finances [47, 48]. In Abowd and Woodcock [1, 2], this technique was adopted to protect longitudinal linked data, that is, microdata that contain observations from two or more related time periods (successive years, etc.). Methods for valid inference on this kind of partial synthetic data were developed in [61] and a non-parametric method was presented in [62] to generate multiply-imputed, partially synthetic data. Closely related to multiply imputed, partially synthetic microdata is model- based disclosure protection [34, 56]. In this approach, a set of confidential continuous outcome attributes is regressed on a disjoint set of non-confidential 68 Privacy-Preserving Data Mining: Models and Algorithms attributes; then the fitted values are released for the confidential attributes in- stead of the original values. A different approach called hybrid masking was proposed in [13]. The idea is to compute masked data as a combination of original and synthetic data. Such a combination allows better control than purely synthetic data over the individual characteristics of masked records. For hybrid masking to be feasible, a rule must be used to pair one original data record with one synthetic data record. An option suggested in [13] is to go through all original data records and pair each original record with the nearest synthetic record according to some distance. Once records have been paired, [13] suggest two possible ways for combining one original record X with one synthetic record Xs: additive combination and multiplicative combination. Additive combination yields Z = αX +(1− α)Xs and multiplicative combination yields Z = Xα ·X(1−α) s where α is an input parameter in [0, 1] and Z is the hybrid record. [13] present empirical results comparing the hybrid approach with rank swapping and mi- croaggregation masking (the synthetic component of hybrid data is generated using Latin Hypercube Sampling [12]). Another approach to combining original and synthetic microdata is pro- posed in [70]. The idea here is to first mask an original dataset using a masking method (see Sections 3.3 and 3.4 above). Then a hill-climbing optimization heuristic is run which seeks to modify the masked data to preserve the first and second-order moments of the original dataset as much as possible without increasing the disclosure risk with respect to the initial masked data. The opti- mization heuristic can be modified to preserve higher-order moments, but this significantly increases computation. Also, the optimization heuristic can take as initial dataset a random dataset instead of a masked dataset; in this case, the output dataset is purely synthetic. 3.5.6 Pros and Cons of Synthetic Microdata As pointed out in Section 3.2, synthetic data are appealing in that, at a first glance, they seem to circumvent the re-identification problem: since published records are invented and do not derive from any original record, it might be concluded that no individual can complain from having been re-identified. At a closer look this advantage is less clear. If, by chance, a published synthetic record matches a particular citizen’s non-confidential attributes (age, marital status, place of residence, etc.) and confidential attributes (salary, mortgage, etc.), re-identification using the non-confidential attributes is easy and that cit- izen may feel that his confidential attributes have been unduly revealed. In that A Survey of Inference Control Methods for Privacy-Preserving Data Mining 69 case, the citizen is unlikely to be happy with or even understand the explanation that the record was synthetically generated. On the other hand, limited data utility is another problem of synthetic data. Only the statistical properties explicitly captured by the model used by the data protector are preserved. A logical question at this point is why not directly publish the statistics one wants to preserve rather than release a synthetic mi- crodata set. One possible justification for synthetic microdata would be if valid analy- ses could be obtained on a number of subdomains, i.e. similar results were ob- tained in a number of subsets of the original dataset and the corresponding sub- sets of the synthetic dataset. Partially synthetic or hybrid microdata are more likely to succeed in staying useful for subdomain analysis. However, when us- ing partially synthetic or hybrid microdata, we lose the attractive feature of purely synthetic data that the number of records in the protected (synthetic) dataset is independent from the number of records in the original dataset. 3.6 Trading off Information Loss and Disclosure Risk Sections 3.2 through 3.5 have presented a plethora of methods to protect microdata. To complicate things further, most of such methods are parametric (e.g., in microaggregation, one parameter is the minimum number of records in a cluster), so the user must go through two choices rather than one: a primary choice to select a method and a secondary choice to select parameters for the method to be used. To help reducing the embarras du choix, some guidelines are needed. 3.6.1 Score Construction The mission of SDC to modify data in such a way that sufficient protection is provided at minimum information loss suggests that a good SDC method is one achieving a good tradeoff between disclosure risk and information loss. Following this idea, [21] proposed a score for method performance rating based on the average of information loss and disclosure risk measures. For each method M and parameterization P, the following score is computed: Score(V,V)=IL(V,V)+DR(V,V) 2 where IL is an information loss measure, DR is a disclosure risk measure and V is the protected dataset obtained after applying method M with para- meterization P to an original dataset V. In [21] and [19] IL and DR were computed using a weighted combination of several information loss and disclosure risk measures. With the resulting score, a ranking of masking methods (and their parameterizations) was ob- 70 Privacy-Preserving Data Mining: Models and Algorithms tained. In [81] the line of the above two papers was followed to rank a different set of methods using a slightly different score. To illustrate how a score can be constructed, we next describe the particular score used in [21]. Example 3.4 Let X and X be matrices representing original and protected datasets, respectively, where all attributes are numerical. Let V and R be the covariance matrix and the correlation matrix of X, respectively; let ¯X be the vector of attribute averages for X and let S be the diagonal of V. Define V , R, ¯X, and S analogously from X. The Information Loss (IL) is computed by averaging the mean variations of X −X, ¯X − ¯X,V −V ,S −S, and the mean absolute error of R − R and multiplying the resulting average by 100. Thus, we obtain the following expression for information loss: IL = 100 5 p j=1 n i=1 |xij−x ij | |xij| np + p j=1 |¯xj− ¯x j| |¯xj| p + p j=1 1≤i≤j |vij −v ij | |vij | p(p+1) 2 + p j=1 |vjj−v jj| |vjj| p + p j=1 1≤i≤j |rij−r ij | p(p−1) 2 The expression of the overall score is obtained by combining information loss and information risk as follows: Score = IL + (0.5DLD+0.5PLD)+ID 2 2 Here, DLD (Distance Linkage Disclosure risk) is the percentage of correctly linked records using distance-based record linkage [19], PLD (Probabilistic Linkage Record Disclosure risk) is the percentage of correctly linked records using probabilistic linkage [29], ID (Interval Disclosure) is the percentage of original records falling in the intervals around their corresponding masked values and IL is the information loss measure defined above. Based on the above score, [21] found that, for the benchmark datasets and the intruder’s external information they used, two good performers among the set of methods and parameterizations they tried were: i) rankswapping with pa- rameter p around 15 (see description above); ii) multivariate microaggregation on unprojected data taking groups of three attributes at a time (Algorithm 3.1 with partitioning of the set of attributes).  Using a score permits to regard the selection of a masking method and its parameters as an optimization problem. This idea was first used in the above- mentioned contribution [70]. In that paper, a masking method was applied to the original data file and then a post-masking optimization procedure was ap- plied to decrease the score obtained. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 71 On the negative side, no specific score weighting can do justice to all meth- ods. Thus, when ranking methods, the values of all measures of information loss and disclosure risk should be supplied along with the overall score. 3.6.2 R-U Maps A tool which may be enlightening when trying to construct a score or, more generally, optimize the tradeoff between information loss and disclosure risk is a graphical representation of pairs of measures (disclosure risk, information loss) or their equivalents (disclosure risk, data utility). Such maps are called R-U confidentiality maps [24, 25]. Here, R stands for disclosure risk and U for data utility. According to [25], “in its most basic form, an R-U confidentiality map is the set of paired values (R, U), of disclosure risk and data utility that correspond to various strategies for data release” (e.g., variations on a parame- ter). Such (R, U) pairs are typically plotted in a two-dimensional graph, so that the user can easily grasp the influence of a particular method and/or parameter choice. 3.6.3 k-anonymity A different approach to facing the conflict between information loss and disclosure risk is suggested by Samarati and Sweeney [67, 66, 73, 74]. A pro- tected dataset is said to satisfy k-anonymity for k>1 if, for each combination of quasi-identifier values (e.g. address, age, gender, etc.), at least k records ex- ist in the dataset sharing that combination. Now if, for a given k, k-anonymity is assumed to be enough protection, one can concentrate on minimizing in- formation loss with the only constraint that k-anonymity should be satisfied. This is a clean way of solving the tension between data protection and data utility. Since k-anonymity is usually achieved via generalization (equivalent to global recoding, as said above) and local suppression, minimizing informa- tion loss usually translates to reducing the number and/or the magnitude of suppressions. k-anonymity bears some resemblance to the underlying principle of mi- croaggregation and is a useful concept because quasi-identifiers are usually categorical or can be categorized, i.e. they take values in a finite (and ideally re- duced) range. However, re-identification is not necessarily based on categorical quasi-identifiers: sometimes, numerical outcome attributes —which are contin- uous and often cannot be categorized— give enough clues for re-identification (see discussion on the MASSC method above). Microaggregation was sug- gested in [23] as a possible way to achieve k-anonymity for numerical, ordinal and nominal attributes. A similar idea called data condensation had also been independently proposed by [4] to achieve k-anonymity for the specific case of numerical attributes. 72 Privacy-Preserving Data Mining: Models and Algorithms Another connection between k-anonymity and microaggregation is the NP- hardness of solving them optimally. Satisfying k-anonymity with minimal data modification has been shown to be NP-hard in [52], which is parallel to the NP- hardness of optimal multivariate microaggregation proven in [55]. 3.7 Conclusions and Research Directions Inference control methods for privacy-preserving data mining are a hot re- search topic progressing very fast. There are still many open issues, some of which can be hopefully solved with further research and some which are likely to stay open due to the inherent nature of SDC. We first list some of the issues that we feel can be and should be settled in the near future: Identifying a comprehensive listing of data uses (e.g. regression models, association rules, etc.) that would allow the definition of data use- specific information loss measures broadly accepted by the commu- nity; those new measures could complement and/or replace the generic measures currently used. Work in this line has been started in Europe in 2006 under the CENEX SDC project sponsored by Eurostat. Devising disclosure risk assessment procedures which are as universally applicable as record linkage while being less greedy in computational terms. Identifying the external data sources that intruders can typically access in order to attempt re-identification for each domain of application. This would help data protectors figuring out in more realistic terms which are the disclosure scenarios they should protect data against. Creating one or several benchmarks to assess the performance of SDC methods. Benchmark creation is currently hampered by the confidential- ity of the original datasets to be protected. Data protectors should agree on a collection of non-confidential original-looking data sets (financial datasets, population datasets, etc.) which can be used by anybody to compare the performance of SDC methods. The benchmark should also incorporate state-of-the-art disclosure risk assessment methods, which requires continuous update and maintenance. There are other issues which, in our view, are less likely to be resolved in the near future, due to the very nature of SDC methods. As pointed out in [22], if an intruder knows the SDC algorithm used to create a protected data set, he can mount algorithm-specific re-identification attacks which can disclose more confidential information than conventional data mining attacks. Keeping secret A Survey of Inference Control Methods for Privacy-Preserving Data Mining 73 the SDC algorithm used would seem a solution, but in many cases the protected dataset itself gives some clues on the SDC algorithm used to produce it. Such is the case for a rounded, microaggregated or partially suppressed microdata set. Thus, it is unclear to what extent the SDC algorithm used can be kept secret. Other data security areas where slightly distorted data are sent to a recipient who is legitimate but untrusted also share the same concerns about the secrecy of protection algorithms in use. This is the case of watermarking. Teaming up with those areas sharing similar problems is probably one clever line of action for SDC. References [1] J. M. Abowd and S. D. Woodcock. Disclosure limitation in longitudinal linked tables. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, ed- itors, Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 215–278, Amsterdam, 2001. North-Holland. [2] J. M. Abowd and S. D. Woodcock. Multiply-imputing confidential char- acteristics and file links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 290–297, Berlin Heidelberg, 2004. Springer. [3] N. R. Adam and J. C. Wortmann. Security-control for statistical data- bases: a comparative study. ACM Computing Surveys, 21(4):515–556, 1989. [4] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy pre- serving data mining. In E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. B¨ohm, E. Ferrari, editors, Advances in Database Technology - EDBT 2004, vol. 2992 of Lecture Notes in Com- puter Science, pages 183-199, Berlin Heidelberg, 2004. Springer. [5] R. Brand. Microdata protection through noise addition. In J. Domingo- Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 97–116, Berlin Heidelberg, 2002. Springer. [6] R. Brand. Tests of the applicability of sullivan’s algorithm to synthetic data and real business data in official statistics, 2002. European Project IST-2000-25069 CASC, Deliverable 1.1-D1, http://neon.vb.cbs.nl/casc. [7] J. Burridge. Information preserving statistical obfuscation. Statistics and Computing, 13:321–327, 2003. 74 Privacy-Preserving Data Mining: Models and Algorithms [8] CASC. Computational aspects of statistical confidentiality, 2004. European project IST-2000-25069 CASC, 5th FP, 2001-2004, http://neon.vb.cbs.nl/casc. [9] F. Y. Chin and G. Ozsoyoglu. Auditing and inference control in statistical databases. IEEE Transactions on Software Engineering, SE-8:574–582, 1982. [10] L. H. Cox and J. J. Kim. Effects of rounding on the quality and confi- dentiality of statistical data. In J. Domingo-Ferrer and L. Franconi, edi- tors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 48–56, Berlin Heidelberg, 2006. [11] T. Dalenius and S. P. Reiss. Data-swapping: a technique for disclosure control (extended abstract). In Proc. of the ASA Section on Survey Re- search Methods, pages 191–194, Washington DC, 1978. American Sta- tistical Association. [12] R. Dandekar, M. Cohen, and N. Kirkendall. Sensitive micro data protec- tion using latin hypercube sampling technique. In J. Domingo-Ferrer, ed- itor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 245–253, Berlin Heidelberg, 2002. Springer. [13] R. Dandekar, J. Domingo-Ferrer, and F. Seb´e. Lhs-based hybrid mi- crodata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 153–162, Berlin Heidelberg, 2002. Springer. [14] P.-P. de Wolf. Risk, utility and pram. In J. Domingo-Ferrer and L. Fran- coni, editors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 189–204, Berlin Heidelberg, 2006. [15] D. Defays and P. Nanopoulos. Panels of enterprises and confidentiality: the small aggregates method. In Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, pages 195–204, Ottawa, 1993. Statis- tics Canada. [16] A. G. DeWaal and L. C. R. J. Willenborg. Global recodings and local suppressions in microdata sets. In Proceedings of Statistics Canada Sym- posium’95, pages 121–132, Ottawa, 1995. Statistics Canada. [17] J. Domingo-Ferrer and J. M. Mateo-Sanz. On resampling for statistical confidentiality in contingency tables. Computers & Mathematics with Applications, 38:13–32, 1999. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 75 [18] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented mi- croaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1):189–201, 2002. [19] J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Comparing sdc meth- ods for microdata on the basis of information loss and disclosure risk. In Pre-proceedings of ETK-NTTS’2001 (vol. 2), pages 807–826, Luxem- burg, 2001. Eurostat. [20] J. Domingo-Ferrer, F. Seb´e,and A. Solanas. A polynomial-time approx- imation to optimal multivariate microaggregation. Computers & Mathe- matics with Applications, 2007. (To appear). [21] J. Domingo-Ferrer and V. Torra. A quantitative comparison of dis- closure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Confidentiality, Disclo- sure and Data Access: Theory and Practical Applications for Sta- tistical Agencies, pages 111–134, Amsterdam, 2001. North-Holland. http://vneumann.etse.urv.es/publications/bcpi. [22] J. Domingo-Ferrer and V. Torra. Algorithmic data mining against privacy protection methods for statistical databases. manuscript, 2004. [23] J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2):195–212, 2005. [24] G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S. F. Roehrig. Disclosure limitation methods and information loss for tabular data. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, editors, Confi- dentiality, Disclosure and Data Access: Theory and Practical Applica- tions for Statistical Agencies, pages 135–166, Amsterdam, 2001. North- Holland. [25] G. T. Duncan, S. A. Keller-McNulty, and S. L Stokes. Disclosure risk vs. data utility: The r-u confidentiality map, 2001. [26] G. T. Duncan and S. Mukherjee. Optimal disclosure limitation strategy in statistical databases: deterring tracker attacks through additive noise. Journal of the American Statistical Association, 95:720–729, 2000. [27] G. T. Duncan and R. W. Pearson. Enhancing access to microdata while protecting confidentiality: prospects for the future. Statistical Science, 6:219–239, 1991. [28] E.U.Privacy. European privacy regulations, 2004. http://europa.eu.int/ comm/internal market/privacy/law en.htm. [29] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. 76 Privacy-Preserving Data Mining: Models and Algorithms [30] S. E. Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Technical Report 611, Carnegie Mellon University Department of Statistics, 1994. [31] S. E. Fienberg, U. E. Makov, and R. J. Steele. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics, 14(4):485–502, 1998. [32] S. E. Fienberg and J. McIntyre. Data swapping: variations on a theme by dalenius and reiss. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Sci- ence, pages 14–29, Berlin Heidelberg, 2004. Springer. [33] A. Florian. An efficient sampling scheme: updated latin hypercube sam- pling. Probabilistic Engineering Mechanics, 7(2):123–130, 1992. [34] L. Franconi and J. Stander. A model based method for disclosure limi- tation of business microdata. Journal of the Royal Statistical Society D - Statistician, 51:1–11, 2002. [35] R. Garfinkel, R. Gopal, and D. Rice. New approaches to dis- closure limitation while answering queries to a database: protect- ing numerical confidential data against insider threat based on data and algorithms, 2004. Manuscript. Available at http://www- eio.upc.es/seminar/04/garfinkel.pdf. [36] S. Giessing. Survey on methods for tabular data protection in argus. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 1–13, Berlin Heidelberg, 2004. Springer. [37] R. Gopal, R. Garfinkel, and P. Goes. Confidentiality via camouflage: the cvc approach to disclosure limitation when answering queries to data- bases. Operations Research, 50:501–516, 2002. [38] R. Gopal, P. Goes, and R. Garfinkel. Interval protection of confidential information in a database. INFORMS Journal on Computing, 10:309– 322, 1998. [39] J. M. Gouweleeuw, P. Kooiman, L. C. R. J. Willenborg, and P.-P. De- Wolf. Post randomisation for statistical disclosure control: Theory and implementation, 1997. Research paper no. 9731 (Voorburg: Statistics Netherlands). [40] B. Greenberg. Rank swapping for ordinal data, 1987. Washington, DC: U. S. Bureau of the Census (unpublished manuscript). [41] S. L. Hansen and S. Mukherjee. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4):1043–1044, 2003. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 77 [42] G. R. Heer. A bootstrap procedure to preserve statistical confidentiality in contingency tables. In D. Lievesley, editor, Proc. of the International Seminar on Statistical Confidentiality, pages 261–271, Luxemburg, 1993. Office for Official Publications of the European Communities. [43] HIPAA. Health insurance portability and accountability act, 2004. http://www.hhs.gov/ocr/hipaa/. [44] A. Hundepool, A. Van de Wetering, R. Ramaswamy, L. Franconi, A. Capobianchi, P.-P. DeWolf, J. Domingo-Ferrer, V. Torra, R. Brand, and S. Giessing. µ-ARGUS version 4.0 Software and User’s Manual. Statis- tics Netherlands, Voorburg NL, may 2005. http://neon.vb.cbs.nl/casc. [45] A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J. Longhurst, E. Schulte-Nordholt, G. Seri, and P.-P. DeWolf. Handbook on Statistical Disclosure Control (version 1.0). Eurostat (CENEX SDC Project Deliverable), 2006. [46] D. E. Huntington and C. S. Lyrintzis. Improvements to and limita- tions of latin hypercube sampling. Probabilistic Engineering Mechanics, 13(4):245–253, 1998. [47] A. B. Kennickell. Multiple imputation and disclosure control: the case of the 1995 survey of consumer finances. In Record Linkage Techniques, pages 248–267, Washington DC, 1999. National Academy Press. [48] A. B. Kennickell. Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg, 1999. Office for Official Publications of the European Communities. [49] J. J. Kim. A method for limiting disclosure in microdata based on ran- dom noise and transformation. In Proceedings of the Section on Survey Research Methods, pages 303–308, Alexandria VA, 1986. American Sta- tistical Association. [50] M. Laszlo and S. Mukherjee. Minimum spanning tree partitioning algo- rithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7):902–911, 2005. [51] J. M. Mateo-Sanz and J. Domingo-Ferrer. A method for data-oriented multivariate microaggregation. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 89–99, Luxemburg, 1999. Office for Official Pub- lications of the European Communities. [52] A. Meyerson and R. Williams. General k-anonymization is hard. Techni- cal Report 03-113, Carnegie Mellon School of Computer Science (USA), 2003. 78 Privacy-Preserving Data Mining: Models and Algorithms [53] R. Moore. Controlled data swapping techniques for masking public use microdata sets, 1996. U. S. Bureau of the Census, Washington, DC, (un- published manuscript). [54] K. Muralidhar, D. Batra, and P. J. Kirs. Accessibility, security and ac- curacy in statistical databases: the case for the multiplicative fixed data perturbation approach. Management Science, 41:1549–1564, 1995. [55] A. Oganian and J. Domingo-Ferrer. On the complexity of optimal mi- croaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Comission for Europe, 18(4):345–354, 2001. [56] S. Polettini, L. Franconi, and J. Stander. Model based disclosure protec- tion. In J. Domingo-Ferrer, editor, Inference Control in Statistical Data- bases, volume 2316 of Lecture Notes in Computer Science, pages 83–96, Berlin Heidelberg, 2002. Springer. [57] T. J. Raghunathan, J. P. Reiter, and D. Rubin. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1–16, 2003. [58] S. P. Reiss. Practical data-swapping: the first steps. ACM Transactions on Database Systems, 9:20–37, 1984. [59] S. P. Reiss, M. J. Post, and T. Dalenius. Non-reversible privacy transfor- mations. In Proceedings of the ACM Symposium on Principles of Data- base Systems, pages 139–146, Los Angeles, CA, 1982. ACM. [60] J. P. Reiter. Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4):531–544, 2002. [61] J. P. Reiter. Inference for partially synthetic, public use microdata sets. Survey Methodology, 29:181–188, 2003. [62] J. P. Reiter. Using cart to generate partially synthetic public use micro- data, 2003. Duke University working paper. [63] J. P. Reiter. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Soci- ety, Series A, 168:185–205, 2005. [64] J. P. Reiter. Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 131(2):365–377, 2005. [65] D. B. Rubin. Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461–468, 1993. [66] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 79 [67] P. Samarati and L. Sweeney. Protecting privacy when disclosing informa- tion: k-anonymity and its enforcement through generalization and sup- pression. Technical report, SRI International, 1998. [68] G. Sande. Exact and approximate methods for data directed microaggre- gation in one or more dimensions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):459–476, 2002. [69] J. Schl¨orer. Disclosure from statistical databases: quantitative aspects of trackers. ACM Transactions on Database Systems, 5:467–492, 1980. [70] F. Seb´e, J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Post- masking optimization of the tradeoff between information loss and disclosure risk in masked microdata sets. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 163–171, Berlin Heidelberg, 2002. Springer. [71] A. C. Singh, F. Yu, and G. H. Dunteman. Massc: A new data mask for lim- iting statistical information loss and disclosure. In H. Linden, J. Riecan, and L. Belsby, editors, Work Session on Statistical Data Confidential- ity 2003, Monographs in Official Statistics, pages 373–394, Luxemburg, 2004. Eurostat. [72] G. R. Sullivan. The Use of Added Error to Avoid Disclosure in Microdata Releases. PhD thesis, Iowa State University, 1989. [73] L. Sweeney. Achieving k-anonymity privacy protection using general- ization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):571–588, 2002. [74] L. Sweeney. k-anonimity: A model for protecting privacy. Interna- tional Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):557–570, 2002. [75] V. Torra. Microaggregation for categorical variables: a median based ap- proach. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 162–174, Berlin Heidelberg, 2004. Springer. [76] J. F. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9:672–679, 1984. [77] U.S.Privacy. U. s. privacy regulations, 2004. http://www.media-aware ness.ca/english/issues/privacy/us legislation privacy.cfm. [78] L. Willenborg and T. DeWaal. Statistical Disclosure Control in Practice. Springer-Verlag, New York, 1996. [79] L. Willenborg and T. DeWaal. Elements of Statistical Disclosure Control. Springer-Verlag, New York, 2001. 80 Privacy-Preserving Data Mining: Models and Algorithms [80] W. E. Winkler. Re-identification methods for masked microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 216–230, Berlin Heidelberg, 2004. Springer. [81] W. E. Yancey, W. E. Winkler, and R. H. Creecy. Disclosure risk assess- ment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 135–152, Berlin Heidelberg, 2002. Springer. Chapter 4 Measures of Anonymity Suresh Venkatasubramanian School of Computing, University of Utah suresh@cs.utah.edu Abstract To design a privacy-preserving data publishing system, we must first quantify the very notion of privacy, or information loss. In the past few years, there has been a proliferation of measures of privacy, some based on statistical considera- tions, others based on Bayesian or information-theoretic notions of information, and even others designed around the limitations of bounded adversaries. In this chapter, we review the various approaches to capturing privacy. We will find that although one can define privacy from different standpoints, there are many structural similarities in the way different approaches have evolved. It will also become clear that the notions of privacy and utility (the useful information one can extract from published data) are intertwined in ways that are yet to be fully resolved. Keywords: Measures of privacy, statistics, Bayes inference, information theory, cryptography. 4.1 Introduction In this chapter, we survey the various approaches that have been proposed to measure privacy (and the loss of privacy). Since most privacy concerns (espe- cially those related to health-care information [44]) are raised in the context of legal concerns, it is instructive to view privacy from a legal perspective, rather than from purely technical considerations. It is beyond the scope of this survey1 to review the legal interpretations of privacy [11]. However, one essay on privacy that appears directly relevant (and has inspired at least one paper surveyed here) is the view of privacy in terms of access that others have to us and our information, presented by Ruth Gavison [23]. In her view, a general definition of privacy must be one that is measurable, of value, and actionable. The first property needs no explanation; the second means that the entity being considered private must be valuable, and the third 82 Privacy-Preserving Data Mining: Models and Algorithms property argues that from a legal perspective, only those losses of privacy are interesting that can be prosecuted. This survey, and much of the research on privacy, concerns itself with the measuring of privacy. The second property is implicit in most discussion of measures of privacy: authors propose basic data items that are valuable and must be protected (fields in a record, background knowledge about a distrib- ution, and so on). The third aspect of privacy is of a legal nature and is not directly relevant to our discussion here. 4.1.1 What is Privacy? To measure privacy, we must define it. This, in essence, is the hardest part of the problem of measuring privacy, and is the reason for the plethora of proposed measures. Once again, we turn to Gavison for some insight. In her paper, she argues that there are three inter-related kinds of privacy: secrecy, anonymity, and solitude. Secrecy concerns information that others may gather about us. Anonymity addresses how much “in the public gaze” we are, and solitude measures the degree to which others have physical access to us. From the perspective of protecting information, solitude relates to the physical pro- tection of data, and is again beyond the purview of this article. Secrecy and anonymity are useful ways of thinking about privacy, and we will see that mea- sures of privacy preservation can be viewed as falling mostly into one of these two categories. If we think of privacy as secrecy (of our information), then a loss of privacy is leakage of that information. This can measured through various means: the probability of a data item being accessed, the change in knowledge of an ad- versary upon seeing the data, and so on. If we think in terms of anonymity, then privacy leakage is measured in terms of the size of the blurring accompanying the release of data: the more the blurring, the more anonymous the data. Privacy versus Utility. It would seem that the most effective way to pre- serve privacy of information would be to encrypt it. Users wishing to access the data could be given keys, and this would summarily solve all privacy is- sues. Unfortunately, this approach does not work in a data publishing scenario, which is the primary setting for much work on privacy preservation. The key notion here is one of utility: the goal of privacy preservation mea- sures is to secure access to confidential information while at the same time releasing aggregate information to the public. One common example used is that of the U.S. Census. The U.S Census wishes to publish survey data from the census so that demographers and other public policy experts can analyze trends in the general population. On the other hand, they wish to avoid releas- ing information that could be used to infer facts about specific individuals; the Measures of Anonymity 83 case of the AOL search query release [34] indicates the dangers of releasing data without adequately anonymizing it. It is this idea of utility that makes cryptographic approaches to privacy preservation problematic. As Dwork points out in her overview of differen- tial privacy [16], a typical cryptographic scenario involves two communicating parties and an adversary attempting to eavesdrop. In the scenarios we consider, the adversary is the same as the recipient of the message, making security guar- antees much harder to prove. Privacy and utility are fundamentally in tension with each other. We can achieve perfect privacy by not releasing any data, but this solution has no util- ity. Thus, any discussion of privacy measures is incomplete without a corre- sponding discussion of utility measures. Traditionally, the two concepts have been measured using different yardsticks, and we are now beginning to see attempts to unify the two notions along a common axis of measurement. A Note on Terminology. Various terms have been used in the literate to describe privacy and privacy loss. Anonymization is a popular term, often used to describe methods like k-anonymity and its successors. Information loss is used by some of the information-theoretic methods, and privacy leakage is another common expression describing the loss of privacy. We will use these terms interchangeably. 4.1.2 Data Anonymization Methods The measures of anonymity we discuss here are usually defined with respect to a particular data anonymization method. There are three primary methods in use today, random perturbation, generalization and suppression.Inwhat follows, we discuss these methods. Perhaps the most natural way of anonymizing numerical data is to perturb it. Rather than reporting a value x for an attribute, we report the value ˜x = x + r, where r is a random value drawn from an appropriate (usually bias-free) dis- tribution. One must be careful with this approach however; if the value r is chosen independently each time x is queried, then simple averaging will elim- inate its effect. Since introducing bias would affect any statistical analysis one might wish to perform on the data, a preferred method is to fix the perturbations in advance. If the attribute x has a domain other than R, then perturbation is more com- plex. As long as the data lies in a continuous metric space (like Rd for in- stance), then a perturbation is well defined. If the data is categorical however, other methods, such as deleting items and inserting other, randomly chosen items, must be employed. We will see more of such methods below. It is often useful to distinguish between two kinds of perturbation. Input perturbation is the process of perturbing the source data itself, and returning 84 Privacy-Preserving Data Mining: Models and Algorithms correct answers to queries on this perturbed data. Output perturbation on the other hand perturbs the answers sent to a query, rather than modifying the input itself. The other method for anonymizing data is generalization, which is often used in conjunction with suppression. Suppose the data domain possesses a natural hierarchical structure. For example, ZIP codes can be thought of as the leaves of a hierarchy, where 8411∗ is the parent of 84117,and84∗ is an ances- tor of 8411∗, and so on. In the presence of such a hierarchy, attributes can be generalized by replacing their values with that of their (common) parent. Again returning to the ZIP code example, ZIP codes of the form 84117, 84118, 84120 might all be replaced by the generic ZIP 841∗. The degree of perturbation can then be measured in terms of the height of the resulting generalization above the leaf values. Data suppression, very simply, is the omission of data. For example, a set of database tuples might all have ZIP code fields of the form 84117 or 84118, with the exception of a few tuples that have a ZIP code field value of 90210. In this case, the outlier tuples can be suppressed in order to construct valid and compact generalization. Another way of performing data suppression is to replace a field with a generic identifier for that field. In the above example, the ZIP code field value of 90210 might be replaced by a null value ⊥ZIP. Another method of data anonymization that was proposed by Zhang et al. [50] is to permute the data. Given a table consisting of sensitive and identifying attributes, their approach is to permute the projection of the table consisting of the sensitive attributes; the purpose of doing this is to retain the aggregate prop- erties of the table, while destroying the link between identifying and sensitive attributes that could lead to a privacy leakage. 4.1.3 A Classification of Methods Broadly speaking, methods for measuring privacy can be divided into three distinct categories. Early work on statistical databases measured privacy in terms of the variance of key perturbed variables: the larger the variance, the better the privacy of the perturbed data. We refer to these approaches as statis- tical methods. Much of the more recent work on privacy measures starts with the obser- vation that statistical methods are unable to quantify the idea of background information that an adversary may possess. As a consequence, researchers have employed tools from information theory and Bayesian analysis to quan- tify more precisely notions of information transfer and loss. We will describe these methods under the general heading of probabilistic methods. Almost in parallel with the development of probabilistic methods, some re- searchers have attacked the problem of privacy from a computational angle. Measures of Anonymity 85 In short, rather than relying on statistical or probabilistic estimates for the amount of information leaked, these measures start from the idea of a resource- bounded adversary, and measure privacy in terms of the amount of information accessible by such an adversary. This approach is reminiscent of cryptographic approaches, but for the reasons outlined above is substantially more difficult. An Important Omission: Secure Multiparty Computation. One impor- tant technique for preserving data privacy is the approach from cryptography called secure multi-party computation (SMC). The simplest version of this framework is the so-called ‘Millionaires Problem’ [49]: Two millionaires wish to know who is richer; however, they do not want to find out inadvertently any additional information about each others wealth. How can they carry out such a conversation? In general, an SMC scenario is described by N clients, each of whom owns some private data, and a public function f(x1,...xN) that needs to be com- puted from the shared data without any of the clients revealing their private information. Notice that in an SMC setting, the clients are trusted, and do not trust the central server to preserve their information (otherwise they could merely trans- mit the required data to the server). In all the privacy-preservation settings we will consider in this article, it is the server that is trusted, and queries to the server emanate from untrusted clients. We will not address SMC-based pri- vacy methods further. 4.2 Statistical Measures of Anonymity 4.2.1 Query Restriction Query restriction was one of the first methods for preserving anonymity in data [22, 25, 21, 40]. For a database of size N, and a fixed parameter k,all queries that returned either fewer than k or more than N − k records were rejected. Query restriction anticipates k-anonymity, in that the method for pre- serving anonymity is by returning a large set of records for any query. Contrast this with data suppression; rather than deleting records, the procedure deletes queries. It was pointed out later [13, 12, 41, 10, 41] that query restriction could be subverted by requesting a specific sequence of queries, and then combining them using simple Boolean operators, in a construction referred to as a tracker. Thus, this mechanism is not very effective. 4.2.2 Anonymity via Variance Here, we start with randomly perturbed data ˜x = x + r, as described in Section 4.1.2. Intuitively, the larger the perturbation, the more blurred, and thus 86 Privacy-Preserving Data Mining: Models and Algorithms more protected the value is. Thus, we can measure anonymity by measuring the variance of the perturbed data. The larger the variance, the better the guarantee of anonymity, and thus one proposal by Duncan et al. [15] is to lower bound the variance for estimators of sensitive attributes. An alternative approach, used by Agrawal and Srikant [3], is to fix a confidence level and measure the length of the interval of values of the estimator that yields this confidence bound; the longer the interval, the more successful the anonymization. Under this model, utility can be measured in a variety of ways. The Dun- can et al. paper measures utility by combining the perturbation scheme with a query restriction method, and measuring the fraction of queries that are permitted after perturbation. Obviously, the larger the perturbation (measured by the variance σ2), the larger the fraction of queries that return sets of high cardinality. This presents a natural tradeoff between privacy (increased by increasing σ2) and utility (increased by increasing the fraction of permitted queries). The paper by Agrawal and Srikant implicitly measures utility in terms of how hard it is to reconstruct the original data distribution. They use many iter- ations of a Bayesian update procedure to perform this reconstruction; however the reconstruction itself provides no guarantees (in terms of distance to the true data distribution). 4.2.3 Anonymity via Multiplicity Perturbation-based privacy works by changing the values of data items. In generalization-based privacy, the idea is to “blur” the data via generalization. The hope here is that the blurred data set will continue to provide the statistical utility that the original data provided, while preventing access to individual tuples. The measure of privacy here is a combinatorial variant of the length-of- interval measure used in [3]. A database is said to be k-anonymous [42] if there is no query that can extract fewer than k records from it. This is achieved by aggregating tuples along a generalization hierarchy: for example, by aggre- gating zip codes upto to the first three digits, and so on. k-anonymity was first defined in the context of record linkage: can tuples from multiple databases be joined together to infer private information inaccessible from the individual sources? The k-anonymity requirement means such access cannot happen, since no query returns fewer than k records, and so cannot be used to isolate a single tuple containing the private information. As a method for blocking record link- age, k-anonymity is effective, and much research has gone into optimizing the computations, investigating the intrinsic hardness of computing it, and gener- alizing it to multiple dimensions. Measures of Anonymity 87 4.3 Probabilistic Measures of Anonymity Upto this point, an information leak has been defined as the revealing of specific data in a tuple. Often though, information can be leaked even if the adversary does not gain access to a specific data item. Such attacks usually rely on knowing aggregate information about the (perturbed) source database, as well as the method of perturbation used when modifying the data. Suppose we attempt to anonymize an attribute X by perturbing it with a random value chosen uniformly from the interval [−1, 1]2. Fixing a confidence level of 100%, and using the measure of privacy from [3], we infer that the privacy achieved by this perturbation is 2 (the length of the interval [−1, 1]). Suppose however that a distribution on the values of X is revealed: namely, X takes a value in the range [0, 1] with probability 1/2, and a value in the range [4, 5] with probability 1/2. In this case, no matter what the actual value of X is, an adversary can infer from the perturbed value ˜X which of the two intervals of length 1 the true value of X really lies in, reducing the effective privacy to at most 1. Incorporating background information changes the focus of anonymity mea- surements. Rather than measuring the likelihood of some data being released, we now have to measure a far more nebulous quantity: the “amount of new information learned by an adversary” relative to the background. In order to do this, we need more precise notions of information leakage than the variance of a perturbed value. This analysis applies irrespective of whether we do anonymization based on random perturbation or generalization. We first consider measures of anonymi- zation that are based on perturbation schemes, following this with an exami- nation of measures based on generalization. In both settings, the measures are probabilistic: they compute functions of distributions defined on the data. 4.3.1 Measures Based on Random Perturbation Using Mutual Information The paper by Agrawal and Aggarwal [2] pro- poses the use of mutual information to measure leaked information. We can use the entropy H(A) to encode the amount of uncertainty (and therefore the degree of privacy) in a random variable A.H(A|B),theconditional entropy of A given B, can be interpreted as the amount of privacy “left” in A after B is revealed. Since entropy is usually expressed in terms of bits of information, we will use the expression 2H(A) to represent the measure of privacy in A.Using this measure, the fraction of privacy leaked by an adversary who knows B can be written as P(A|B)=1− 2H(A|B)/2H(A) =1− 2−I(A;B) 88 Privacy-Preserving Data Mining: Models and Algorithms where I(A;B)=H(A) − H(A|B) is the mutual information between the random variables A and B. They also develop a notion of utility measured by the statistical distance be- tween the source distribution of data and the perturbed distribution. They also demonstrate an EM-based method for reconstructing the maximum likelihood estimate of the source distribution, and show that it converges to the correct answer (they do not address the issue of rate of convergence). Handling Categorical Values The above schemes rely on the source data be- ing numerical. For data mining applications, the relevant source data is usually categorical, consisting of collections of transactions, each transaction defined as a set of items. For example, in the typical market-basket setting, a transac- tion consists of a set of items purchased by a customer. Such sets are typically represented by binary characteristic vectors. The el- ementary datum that requires anonymity is membership: does item i belong to transaction t? The questions requiring utility, on the other hand, are of the form, “which patterns have reasonable support and confidence”? In such a set- ting, the only possible perturbation is to flip an item’s membership in a trans- action, but not so often as to change the answers to questions about patterns in any significant way. There are two ways of measuring privacy in this setting. The approach taken by Evfimievski et al. [20] is to evaluate whether an anonymization scheme leaves clues for an adversary with high probability. Specifically, the define a privacy breach one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. Definition 4.3.1 An itemset A causes a privacy breach of level ρ if for some item a ∈ A and some i ∈ 1 ...N we have P[a ∈ ti|A ⊆ t i] ≥ ρ. Here, the event “A ⊆ ti” is leaking information about the event “a ∈ ti”. Note that this measure is absolute, regardless of what the prior probability of a ∈ ti might have been. The perturbation method is based on randomly sampling some items of the transaction ti to keep, and buffering with elements a ∈ ti at random. The second approach, taken by Rizvi and Haritsa [38], is to measure privacy in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. This can be calculated using Bayes’ Theorem, and is parame- trized by the probability of flipping a bit (which they set to a constant p). Pri- vacy is then achieved by setting p to a value that minimizes the reconstruction probability; the authors show that a wide range of values for p yields acceptable privacy thresholds. Measures of Anonymity 89 Both papers then frame utility as the problem of reconstructing itemset fre- quencies accurately. [20] establishes a tradeoff between utility more precisely, in terms of the probabilities p[l → l]=P[#(t ∩ A)=l|#(t ∩ A)=l]. For privacy, we have to ensure that (for example) if we fix an element a ∈ t, then the set of tuples t that do contain a are not overly represented in the modified itemset. Specifically, in terms of an average over the size of tuple sets returned, we obtain a condition on the p[l → l]. In essence, the probabilities p[l → l] encode the tradeoff between utility (or ease of reconstruction) and privacy. Measuring Transfer of Information Both the above papers have the same weakness that plagued the original statistics-based anonymization works: they ignore the problem of the background knowledge attack. A related, and yet subtlely different problem is that ignoring the source data distribution may yield meaningless results. For example, suppose the probability of an item oc- curring any particular transaction is very high. Then the probability of recon- structing its value correctly is also high, but this would not ordinarily be viewed as a leak of information. A more informative approach would be to measure the level of “surprise”: namely whether the probability P[a ∈ ti] increases (or decreases) dramatically, conditioned on seeing the event A ⊆ t i. Notice that this idea is the motivation for [2]; in their paper, the mutual information I(A;B) measures the transfer of information between the source and anonymized data. Evfimievski et al. [19], in a followup to [20], develop a slightly different notion of information transfer, motivated by the idea that mutual information is an “averaged” measure and that for privacy preservation, worst-case bounds are more relevant. Formally, information leakage is measured by estimating the change in probability of a property from source to distorted data. For example, given a property Q(X) of the data, they say that there is a privacy breach after perturbing the data by function R(X) if for some y, P[Q(X)] ≤ ρ1,P[Q(X)|R(X)=y] ≥ ρ2 where ρ1 ρ2. However, ensuring that this property holds is computationally intensive. The authors show that a sufficient condition for guaranteeing no (ρ1,ρ2) privacy breach is to bound the difference in probability between two different xi be- ing mapped to a particular y. Formally, they propose perturbation schemes such that p[x1 → y] p[x2 → y] ≤ γ 90 Privacy-Preserving Data Mining: Models and Algorithms Intuitively, this means that if we look back from y, there is no easy way of telling whether the source was x1 or x2. The formal relation to (ρ1,ρ2)-privacy is established via this intuition. Formally, we can rewrite I(X;Y)= y p(y)KL(p(X|Y = y)|p(X)) The function KL(p(X|Y = y)|p(X)) measures the transfer distance; it asks how different the induced distribution p(X|Y = y) is from the source distri- bution p(X). The more the difference is, the less the privacy breach is. The authors propose replacing the averaging in the above expression by a max, yielding a modified notion Iw(X;Y)=maxy p(y)KL(p(X|Y = y)|p(X)) They then show that a (ρ1,ρ2)-privacy breach yields a lower bound on the worst-case mutual information Iw(X;Y), which is what we would expect. More general perturbation schemes All of the above described perturba- tion schemes are local: perturbations are applied independently to data items. Kargupta et al. [27] showed that the lack of correlation between perturbations can be used to attack such a privacy-preserving mechanism. Their key idea is a spectral filtering method based on computing principal components of the data transformation matrix. Their results suggest that for more effective privacy preservation, one should consider more general perturbation schemes. It is not hard to see that a nat- ural generalization of these perturbation schemes is a Markov-chain based ap- proach, where an item x is perturbed to item y based on a transition probability p(y|x). FRAPP [4] is one such scheme based on this idea. The authors show that they can express the notion of a (ρ1,ρ2)-privacy breach in terms of prop- erties of the Markov transition matrix. Moreover, they can express the utility of this scheme in terms of the condition number of the transition matrix. 4.3.2 Measures Based on Generalization It is possible to mount a ‘background knowledge’ attack on k-anonymity. For example, it is possible that all the k records returned from a particular query share the same value of some attribute. Knowing that the desired tuple is one of the k tuples, we have thus extracted a value from this tuple without needing to isolate it. The first approach to address this problem was the work on -diversity [32]. Here, the authors start with the now-familiar idea that the privacy measure should capture the change in the adversary’s world-view upon seeing the data. Measures of Anonymity 91 However, they execute this idea with an approach that is absolute. They require that the distribution of sensitive values in an aggregate have high entropy (at least log ). This subsumes k-anonymity, since we can think of the probability of leakage of a single tuple in k-anonymity as 1/k, and so the “entropy” of the aggregate is log k. Starting with this idea, they introduce variants of -diversity that are more relaxed about disclosure, or allow one to distinguish between positive and negative disclosure, or even allow for multi-attribute disclosure measurement. Concurrently published, the work on p-sensitive-k-anonymity [43] attempts to do the same thing, but in a more limited way, by requiring at least p dis- tinct sensitive values in each generalization block, instead of using entropy. A variant of this idea was proposed by Wong et al. [47]; in their scheme, termed (α, k)-anonymity, the additional constraint imposed on the generaliza- tion is that the fractional frequency of each value in a generalization is no more than α. Note that this approach automatically lower bounds the entropy of the generalization by log(1/α). Machanavajjhala et al. [32] make the point that it is difficult to model the adversary’s background knowledge; they use this argument to justify the -diversity measure. One way to address this problem is to assume that the adversary has access to global statistics of the sensitive attribute in question. In this case, the goal is to make the sensitive attribute “blend in”; its distribution in the generalization should mimic its distribution in the source data. This is the approach taken by Li, Li and the author [31]. They define a mea- sure called t-closeness that requires that the “distance” between the distribution of a sensitive attribute in the generalized and original tables is at most t. A natural distance measure to use would be the KL-distance from the gener- alized to the source distribution. However, for numerical attributes, the notion of closeness must incorporate the notion of a metric on the attribute. For ex- ample, suppose that a salary field in a table is generalized to have three distinct values (20000, 21000, 22000). One might reasonably argue that this general- ization leaks more information than a generalization that has the three distinct values (20000, 50000, 80000). Computing the distance between two distributions where the underlying do- mains inhabit a metric space can be performed using the metric known as the earth-mover distance [39], or the Monge-Kantorovich transportation distance [24]. Formally, suppose we have two distributions p, q defined over the ele- ments X of a metric space (X, d). Then the earth-mover distance between p and q is dE(p, q)= inf P[x|x] x,x d(x, x)P[x|x]p(x) subject to the constraint x P[x|x]p(x)=q(x). 92 Privacy-Preserving Data Mining: Models and Algorithms Intuitively, this distance is defined as the value that minimizes the trans- portation cost of transforming one distribution to the other, where transporta- tion cost is measured in terms of the distance in the underlying metric space. Note that since any underlying metric can be used, this approach can be used to integrate numerical and categorical attributes, by imposing any suitable metric (based on domain generalization or other methods) on the categorical attributes. The idea of extending the notion of diversity to numerical attributes was also considered by Zhang et al. [50]. In this paper, the notion of distance for nu- merical attributes is extended in a different way: the goal for the k-anonymous blocks is that the “diameter” of the range of sensitive attributes is larger than a parameter e. Such a generalization is said to be (k, e)-anonymous. Note that this condition makes utility difficult. If we relate this to the -diversity condi- tion of having at least  distinct values, this represents a natural generalization of the approach. As stated however, the approach appears to require defining a total order on the domain of the attribute; this would prevent it from being used for higher dimensional attributes sets. Another interesting feature of the Zhang et al. method is that it considers the down-stream problem of answering aggregate queries on an anonymized database, and argues that rather than performing generalization, it might be better to perform a permutation of the data. They show that this permutation- based anonymization can answer aggregate queries more accurately than generalization-based anonymization. Anonymizing Inferences. In all of the above measures, the data being protected is an attribute of a record, or some distributional characteristic of the data. Another approach to anonymization is to protect the possible infer- ences that can be made from the data; this is akin to the approach taken by Evfimievski et al. [19, 20] for perturbation-based privacy. Wang et al. [45] investigate this idea in the context of generalization and suppression. A pri- vacy template is an inference on the data, coupled with a confidence bound, and the requirement is that in the anonymized data, this inference not be valid with a confidence larger than the provided bound. In their paper, they present a scheme based on data suppression (equivalent to using a unit height gen- eralization hierarchy) to ensure that a given set of privacy templates can be preserved. Clustering as k-anonymity. Viewing attributes as elements of metric space and defining privacy accordingly has not been studied extensively. However, from the perspective of generalization, many papers ( [7, 30, 35]) have pointed out that generalization along a domain generalization hierarchy is only one way of aggregating data. In fact, if we endow the attribute space with a metric, then Measures of Anonymity 93 the process of generalization can be viewed in general as a clustering problem on this metric space, where the appropriate measure of anonymity is applied to each cluster, rather than to each generalized group. Such an approach has the advantage of placing different kinds of attributes on an equal footing. When anonymizing categorical attributes, generaliza- tion proceeds along a generalization hierarchy, which can be interpreted as defining a tree metric. Numerical attributes are generalized along ranges, and t-closeness works with attributes in a general metric space. By lifting all such attributes to a general metric space, generalization can happen in a uniform manner, measured in terms of the diameters of the clusters. Strictly speaking, these methods do not introduce a new notion of privacy; however, they do extend the applicability of generalization-based privacy mea- sures like k-anonymity and its successors. Measuring utility in generalization-based anonymity The original k- anonymity work defines the utility of a generalized table as follows. Each cell is the result of generalizing an attribute up a certain number of levels in a generalization hierarchy. In normalized form, the “height” of a generalization ranges from 0 if the original value is used, to 1 if a completely generalized value is used (in the scheme proposed, a value of 1 corresponds to value suppression, since that is the top level of all hierarchies). The precision of a generalization scheme is then 1 - the average height of a generalization (mea- sured over all cells). The precision is 1 if there is no generalization and is 0 if all values are generalized. Bayardo and Agrawal ( [5]) define a different utility measure for k- anonymity. In their view, a tuple that inhabits a generalized equivalence class E of size |E| = j, j > k incurs a “cost” of j. A tuple that is suppressed entirely incurs a cost of D,whereD is the size of the entire database. Thus, the cost incurred by an anonymization is given by C = |E|≥k |E|2 + |E|0, there is some adversary running in time t(n) that can succeed with high probability. In this model, adversaries are surprisingly strong. The authors show that even with almost-linear perturbation, an adversary permitted to run in expo- nential time can break privacy. Restricting the adversary to run in polynomial time helps, but only slightly; any perturbation E = o √ n is not enough to pre- serve privacy, and this is tight. Feasibility results are hard to prove in this model: as the authors point out, an adversary, with one query, can distinguish between the databases 1n and 0n if it has background knowledge that these are the only two choices. A per- turbation of n/2 would be needed to hide the database contents. One way of circumventing this is to assume that the database itself is generated from some distribution, and that the adversary is required to reveal the value of a specific bit (say, the ith bit) after making an arbitrary number of queries, and after being given all bits of the database except the ith bit. In this setting, privacy is defined as the condition that the adversary’s re- construction probability is at most 1/2+δ. In this setting, they show that a T(n)-perturbed database is private against all adversaries that run in time T(n). Measuring Anonymity Via Information Transfer As before, in the case of probabilistic methods, we can reformulate the anonymity question in terms of information transfer; how much does the probability of a bit being 1 (or 0) change upon anonymization ? 96 Privacy-Preserving Data Mining: Models and Algorithms Dwork and Nissim [18] explore this idea in the context of computation- ally bounded adversaries. Starting with a database d represented as a Boolean matrix and drawn from a distribution D, we can define the prior probability pij 0 = P[dij =1]. Once an adversary asks T queries to the anonymized database as above, and all other values of the database are provided, we can now define the posterior probability pij T of dij taking the value 1. The change in belief can be quantified by the expression ∆=|c(pij T) − c(pij 0 )|,where c(x)=log(x/(1 − x)) is a monotonically increasing function of x. This is the simplified version of their formulation. In general, we can replace the event dij =1by the more general f(di1,di2,...dik)=1,wheref is some k-ary Boolean function. All the above definitions translate to this more general setting. We can now define (δ, T (n))-privacy as the condition that for all distributions over databases, all functions f, and all adversaries making T queries, the probability that the maximum change of belief is more than δ is negligibly small. As with [14], the authors show a natural tradeoff between the degree of perturbation needed, and the level of privacy achieved. Specifically, the au- thors show that a previously proposed algorithm SuLQ [6] achieves (δ, T (n)) privacy with a perturbation E = O( T(n)/δ). They then go on to show that under such conditions, it is possible to perform efficient and accurate data min- ing on the anonymized database to estimate probabilities of the form P[β|α], where α, β are two attributes. Indistinguishability Although the above measures of privacy develop pre- cise notions of information transfer with respect to a bounded adversary, they still require some notion of a distribution on the input databases, as well as a specific protocol followed by an adversary. To abstract the ideas underlying privacy further, Dwork et al. [17] formulate a definition of privacy inspired by Dalenius [16]: A database is private if anything learnable from it can be learned in the absence of the database. In order to do this, they distinguish between non-interactive privacy mech- anisms, where the data publisher anonymizes the data and publishes it (input perturbation), and interactive mechanisms, in which the output to queries are perturbed (output perturbation). Dwork [16] shows that in a non-interactive setting, it is impossible to achieve privacy under this definition; in other words, it is always possible to design an adversary and an auxiliary information gener- ator such that the adversary, combining the anonymized data and the auxiliary information, can effect a privacy breach far more often than an adversary lack- ing access to the database can. In the interactive setting, we can think of the interaction between the data- base and the adversary as a transcript. The idea of indistinguishability is that if two databases are very similar, then their transcripts with respect to an ad- Measures of Anonymity 97 versary should also be similar. Intuitively, this means that if an individual adds their data to a database (causing a small change), the nominal loss in privacy is very small. The main consequence of this formulation is that it is possible to design per- turbation schemes that depend only on the query functions and the error terms, and are independent of the database. Informally, the amount of perturbation required depends on the sensitivity of the query functions: the more the func- tion can change when one input is perturbed slightly, the more perturbation the database must incur. The details of these procedures are quite technical: the reader is referred to [16, 17] for more details. 4.4.1 Anonymity via Isolation Another approach to anonymization is taken by [8, 9]. The underlying prin- ciple here is isolation: a record is private if it cannot be singled out from its neighbors. Formally, they define an adversary as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q. The adversary succeeds if a small ball around q does not contain too many points of the database. In this sense, the adversary has isolated some points of the database3. Under this definition of a privacy breach, they then develop methods for anonymizing a database. Like the papers above, they use a differential model of privacy: an anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data. One technical problem with the idea of isolation, which the authors acknowledge, is that it can be attacked in the same way that methods like k-anonymity are attacked. If the anonymization causes many points with sim- ilar characteristics to cluster together, then even though the adversary cannot isolate a single point, it can determine some special characteristics of the data from the clustering that might not have otherwise been inferred. 4.5 Conclusions and New Directions The evolution of measures of privacy, irrespective of the specific method of perturbation or class of measure, has proceeded along a standard path. The earliest measures are absolute in nature, defining an intuitive notion of privacy in terms of a measure of obfuscation. Further development occurs when the notion of background information is brought in, and this culminates in the idea of a change in adversarial information before and after the anonymized data is presented. From the perspective of theoretical rigor, computational approaches to pri- vacy are the most attractive. They rely on few to no modelling assumptions 98 Privacy-Preserving Data Mining: Models and Algorithms about adversaries, and their cryptographic flavor reinforces our belief in their overall reliability as measures of privacy. Although the actual privacy preserva- tion methods proposed in this space are fairly simple, they do work from very simple models of the underlying database, and one question that so far remains unanswered is the degree to which these methods can be made practically ef- fective when dealing with the intricacies of actual databases. The most extensive attention has been paid to the probabilistic approaches to privacy measurements. k-anonymity and its successors have inspired nu- merous works that study not only variants of the basic measures, but systems for managing privacy, extensions to higher dimensional spaces, as well as better methods for publishing data tables. The challenge in dealing with meth- ods deriving from k-anonymity is the veritable alphabet soup of approaches that have been proposed, all varying subtlety in the nature of the assumptions used. The work by Wong et al. [46] illustrates the subtleties of modelling background information; their m-confidentiality measure attempts to model adversaries who exploit the desire of k-anonymizing schemes to generate a minimal anonymization. This kind of background information is very hard to formalize and argue rigorously about, even when we consider the general framework for analyzing background information proposed by Martin et al. [33]. 4.5.1 New Directions There are two recent directions in the area of privacy preservation measures that are quite interesting and merit further study. The first addresses the prob- lem noted earlier: the imbalance in the study of utility versus privacy. The computational approaches to privacy preservation, starting with the work of Dinur and Nissim [14], provide formal tradeoffs between utility and privacy, for bounded adversaries. The work of Kifer et al. [28] on injecting utility into privacy-preservation allows for a more general measure of utility as a distance between distributions, and Rastogi et al. [37] examine the tradeoff between privacy and utility rigorously in the perturbation framework. With a few exceptions, all of the above measures of privacy are global:they assume a worst-case (or average-case) measure of privacy over the entire input, or prove privacy guarantees that are independent of the specific instance of a database being anonymized. It is therefore natural to consider personalized privacy, where the privacy guarantee need only be accurate with respect to the specific instance being considered, or can be tuned depending on auxiliary inputs. The technique for anonymizing inferences developed in [45] can be viewed as such a scheme: the set of inferences needing protection are supplied as part of the input, and other inferences need not be protected. In the context Measures of Anonymity 99 of k-anonymity, Xiao and Tao [48] propose a technique that takes as input user preferences about the level of generalization they desire for their sensitive at- tributes, and adapts the k-anonymity method to satisfy these preferences. The work on worst-case background information modelling by Martin et al. [33] assumes that the specific background knowledge possessed by an adversary is an input to the privacy-preservation algorithm. Recent work by Nissim et al. [36] revisits the indistinguishability measure [17] (which is oblivious of the specific database instance) by designing an instance-based property of the query function that they use to anonymize a given database. Notes 1. ...and the expertise of the author! 2. This example is taken from [2]. 3. This bears a strong resemblance to k-anonymity, but is more general. References [1] Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, April 15-20, 2007, The Marmara Hotel, Istanbul, Turkey (2007), IEEE. [2] AGRAWAL,D.,AND AGGARWAL, C. C. On the design and quantifica- tion of privacy preserving data mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (Santa Barbara, CA, 2001), pp. 247–255. [3] AGRAWAL,R.,AND SRIKANT, R. Privacy preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data (Dallas, TX, May 2000), pp. 439–450. [4] AGRAWAL,S.,AND HARITSA, J. R. FRAPP: A framework for high- accuracy privacy-preserving mining. In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05) (Wash- ington, DC, USA, 2005), IEEE Computer Society, pp. 193–204. [5] BAYARDO,JR., R. J., AND AGRAWAL, R. Data privacy through optimal k-anonymization. In ICDE (2005), IEEE Computer Society, pp. 217–228. [6] BLUM,A.,DWORK,C.,MCSHERRY,F.,AND NISSIM, K. Practical pri- vacy: the sulq framework. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (New York, NY, USA, 2005), ACM Press, pp. 128–138. [7] BYUN,J.-W.,KAMRA,A.,BERTINO,E.,AND LI, N. Efficient - anonymization using clustering techniques. In DASFAA (2007), K. Ramamohanarao, P. R. Krishna, M. K. Mohania, and E. Nantajee- warawat, Eds., vol. 4443 of Lecture Notes in Computer Science, Springer, pp. 188–200. 100 Privacy-Preserving Data Mining: Models and Algorithms [8] CHAWLA,S.,DWORK,C.,MCSHERRY,F.,SMITH,A.,AND WEE, H. Toward privacy in public databases. In TCC (2005), J. Kilian, Ed., vol. 3378 of Lecture Notes in Computer Science, Springer, pp. 363–385. [9] CHAWLA,S.,DWORK,C.,MCSHERRY,F.,AND TALWAR,K.On privacy-preserving histograms. In UAI (2005), AUAI Press. [10] DE JONGE, W. Compromising statistical databases responding to queries about means. ACM Trans. Database Syst. 8, 1 (1983), 60–80. [11] DECEW,J.Privacy.InThe Stanford Encyclopedia of Philosophy, E. N. Zalta, Ed. Fall 2006. [12] DENNING,D.E.,DENNING,P.J.,AND SCHWARTZ, M. D. The tracker: A threat to statistical database security. ACM Trans. Database Syst. 4,1 (1979), 76–96. [13] DENNING,D.E.,AND SCHL¨ORER, J. A fast procedure for finding a tracker in a statistical database. ACM Trans. Database Syst. 5, 1 (1980), 88–102. [14] DINUR,I.,AND NISSIM, K. Revealing information while preserving privacy. In PODS ’03: Proceedings of the twenty-second ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems (New York, NY, USA, 2003), ACM Press, pp. 202–210. [15] DUNCAN,G.T.,AND MUKHERJEE, S. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95, 451 (2000), 720. [16] DWORK, C. Differential privacy. In Proc. 33rd Intnl. Conf. Automata, Languages and Programming (ICALP) (2006), pp. 1–12. Invited paper. [17] DWORK,C.,MCSHERRY,F.,NISSIM,K.,AND SMITH, A. Calibrating noise to sensitivity in private data analysis. In TCC (2006), S. Halevi and T. Rabin, Eds., vol. 3876 of Lecture Notes in Computer Science, Springer, pp. 265–284. [18] DWORK,C.,AND NISSIM, K. Privacy-preserving datamining on ver- tically partitioned databases. In CRYPTO (2004), M. K. Franklin, Ed., vol. 3152 of Lecture Notes in Computer Science, Springer, pp. 528–544. [19] EVFIMEVSKI,A.,GEHRKE,J.,AND SRIKANT, R. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the ACM SIGMOD/PODS Conference (San Diego, CA, June 2003), pp. 211–222. [20] EVFIMIEVSKI,A.,SRIKANT,R.,AGRAWAL,R.,AND GEHRKE,J. Privacy preserving mining of association rules. In KDD ’02: Proceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2002), ACM Press, pp. 217–228. Measures of Anonymity 101 [21] FELLEGI, I. P. On the question of statistical confidentiality. J. Am. Stat. Assoc 67, 337 (1972), 7–18. [22] FRIEDMAN,A.D.,AND HOFFMAN, L. J. Towards a fail-safe approach to secure databases. In Proc. IEEE Symp. Security and Privacy (1980). [23] GAV ISO N, R. Privacy and the limits of the law. The Yale Law Journal 89, 3 (January 1980), 421–471. [24] GIVENS,C.R.,AND SHORTT, R. M. A class of Wasserstein metrics for probability distributions. Michigan Math J. 31 (1984), 231–240. [25] HOFFMAN,L.J.,AND MILLER, W. F. Getting a personal dossier from a statistical data bank. Datamation 16, 5 (1970), 74–75. [26] IYENGAR, V. S. Transforming data to satisfy privacy constraints. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2002), ACM Press, pp. 279–288. [27] KARGUPTA,H.,DATTA,S.,WANG,Q.,AND SIVAKUMAR,K.On the privacy preserving properties of random data perturbation techniques. In Proceedings of the IEEE International Conference on Data Mining (Melbourne, FL, November 2003), p. 99. [28] KIFER,D.,AND GEHRKE, J. Injecting utility into anonymized datasets. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data (New York, NY, USA, 2006), ACM Press, pp. 217–228. [29] KOCH,C.,GEHRKE,J.,GAROFALAKIS,M.N.,SR IVA STAVA,D., ABERER,K.,DESHPANDE,A.,FLORESCU,D.,CHAN,C.Y.,GANTI, V. , K ANNE,C.-C.,KLAS,W.,AND NEUHOLD,E.J.,Eds.Proceedings of the 33rd International Conference on Very Large Data Bases, Univer- sity of Vienna, Austria, September 23-27, 2007 (2007), ACM. [30] LEFEVRE,K.,DEWITT,D.J.,AND RAMAKRISHNAN, R. Mondrian multidimensional k-anonymity. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) (Washington, DC, USA, 2006), IEEE Computer Society, p. 25. [31] LI,N.,LI,T.,AND VENKATASUBRAMANIAN,S.t-closeness: Privacy beyond k-anonymity and -diversity. In IEEE International Conference on Data Engineering (this proceedings) (2007). [32] MACHANAVAJJHALA,A.,GEHRKE,J.,KIFER,D.,AND VENKITA- SUBRAMANIAM, M. l-diversity: Privacy beyond k-anonymity. In Pro- ceedings of the 22nd International Conference on Data Engineering (ICDE’06) (2006), p. 24. 102 Privacy-Preserving Data Mining: Models and Algorithms [33] MARTIN,D.J.,KIFER,D.,MACHANAVAJJHALA,A.,GEHRKE,J., AND HALPERN, J. Y. Worst-case background knowledge for privacy- preserving data publishing. In ICDE [1], pp. 126–135. [34] NAKASHIMA, E. AOL Search Queries Open Window Onto Users’ Worlds. The Washington Post (August 17 2006). [35] NERGIZ,M.E.,AND CLIFTON, C. Thoughts on k-anonymization. In ICDE Workshops (2006), R. S. Barga and X. Zhou, Eds., IEEE Computer Society, p. 96. [36] NISSIM,K.,RASKHODNIKOVA,S.,AND SMITH, A. Smooth sensitivity and sampling in private data analysis. In STOC ’07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (New York, NY, USA, 2007), ACM Press, pp. 75–84. [37] RASTOGI,V.,HONG,S.,AND SUCIU, D. The boundary between pri- vacy and utility in data publishing. In Koch et al. [29], pp. 531–542. [38] RIZVI,S.J.,AND HARITSA, J. R. Maintaining data privacy in asso- ciation rule mining. In VLDB ’2002: Proceedings of the 28th interna- tional conference on Very Large Data Bases (2002), VLDB Endowment, pp. 682–693. [39] RUBNER,Y.,TOMASI,C.,AND GUIBAS, L. J. The earth mover’s dis- tance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99–121. [40] SCHL¨ORER, J. Identification and retrieval of personal records from a statistical data bank. Methods Info. Med. 14, 1 (1975), 7–13. [41] SCHWARTZ,M.D.,DENNING,D.E.,AND DENNING, P. J. Linear queries in statistical databases. ACM Trans. Database Syst. 4, 2 (1979), 156–167. [42] SWEENEY, L. Achieving k-anonymity privacy protection using general- ization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 5 (2002), 571–588. [43] TRUTA,T.M.,AND VINAY, B. Privacy protection: p-sensitive k- anonymity property. In ICDEW ’06: Proceedings of the 22nd Interna- tional Conference on Data Engineering Workshops (ICDEW’06) (Wash- ington, DC, USA, 2006), IEEE Computer Society, p. 94. [44] U. S. DEPARTMENT OF HEALTH AND HUMAN SERVICES.Officefor Civil Rights - HIPAA. http://www.hhs.gov/ocr/hipaa/. [45] WANG,K.,FUNG,B.C.M.,AND YU, P. S. Handicapping attacker’s confidence: an alternative to k-anonymization. Knowl. Inf. Syst. 11,3 (2007), 345–368. Measures of Anonymity 103 [46] WONG,R.C.-W.,FU,A.W.-C.,WANG,K.,AND PEI, J. Minimal- ity attack in privacy preserving data publishing. In Koch et al. [29], pp. 543–554. [47] WONG,R.C.-W.,LI,J.,FU,A.W.-C.,AND WANG,K. (α,k)- anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In KDD ’06: Proceedings of the 12th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining (New York, NY, USA, 2006), ACM Press, pp. 754–759. [48] XIAO,X.,AND TAO, Y. Personalized privacy preservation. In SIG- MOD ’06: Proceedings of the 2006 ACM SIGMOD international confer- ence on Management of data (New York, NY, USA, 2006), ACM Press, pp. 229–240. [49] YAO, A. C. Protocols for secure computations. In Proc. IEEE Founda- tions of Computer Science (1982), pp. 160–164. [50] ZHANG,Q.,KOUDAS,N.,SR IVA STAVA,D.,AND YU, T. Aggregate query answering on anonymized tables. In ICDE [1], pp. 116–125. Chapter 5 k-Anonymous Data Mining: A Survey V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati DTI - Universit`a degli Studi di Milano 26013 Crema - Italy {ciriani, decapita, foresti, samarati}@dti.unimi.it Abstract Data mining technology has attracted significant interest as a means of identify- ing patterns and trends from large collections of data. It is however evident that the collection and analysis of data that include personal information may violate the privacy of the individuals to whom information refers. Privacy protection in data mining is then becoming a crucial issue that has captured the attention of many researchers. In this chapter, we first describe the concept of k-anonymity and illustrate different approaches for its enforcement. We then discuss how the privacy re- quirements characterized by k-anonymity can be violated in data mining and introduce possible approaches to ensure the satisfaction of k-anonymity in data mining. Keywords: k-anonymity, data mining, privacy. 5.1 Introduction The amount of data being collected every day by private and public organi- zations is quickly increasing. In such a scenario, data mining techniques are be- coming more and more important for assisting decision making processes and, more generally, to extract hidden knowledge from massive data collections in the form of patterns, models, and trends that hold in the data collections. While not explicitly containing the original actual data, data mining results could po- tentially be exploited to infer information - contained in the original data - and not intended for release, then potentially breaching the privacy of the parties to whom the data refer. Effective application of data mining can take place only if proper guarantees are given that the privacy of the underlying data is not com- promised. The concept of privacy preserving data mining has been proposed in response to these privacy concerns [6]. Privacy preserving data mining aims 106 Privacy-Preserving Data Mining: Models and Algorithms at providing a trade-off between sharing information for data mining analy- sis, on the one side, and protecting information to preserve the privacy of the involved parties on the other side. Several privacy preserving data mining ap- proaches have been proposed, which usually protect data by modifying them to mask or erase the original sensitive data that should not be revealed [4, 6, 13]. These approaches typically are based on the concepts of: loss of privacy, measuring the capacity of estimating the original data from the modified data, and loss of information, measuring the loss of accuracy in the data. In gen- eral, the more the privacy of the respondents to which the data refer, the less accurate the result obtained by the miner and vice versa. The main goal of these approaches is therefore to provide a trade-off between privacy and accu- racy. Other approaches to privacy preserving data mining exploit cryptographic techniques for preventing information leakage [20, 30]. The main problem of cryptography-based techniques is, however, that they are usually computation- ally expensive. Privacy preserving data mining techniques clearly depend on the defini- tion of privacy, which captures what information is sensitive in the original data and should therefore be protected from either direct or indirect (via in- ference) disclosure. In this chapter, we consider a specific aspect of privacy that has been receiving considerable attention recently, and that is captured by the notion of k-anonymity [11, 26, 27]. k-anonymity is a property that models the protection of released data against possible re-identification of the respon- dents to which the data refer. Intuitively, k-anonymity states that each release of data must be such that every combination of values of released attributes that are also externally available and therefore exploitable for linking can be indistinctly matched to at least k respondents. k-anonymous data mining has been recently introduced as an approach to ensuring privacy-preservation when releasing data mining results. Very few, preliminary, attempts have been pre- sented looking at different aspects in guaranteeing k-anonymity in data mining. We discuss possible threats to k-anonymity posed by data mining and sketch possible approaches to their counteracting, also briefly illustrating some pre- liminary results existing in the current literature. After recalling the concept of k-anonymity (Section 5.2) and some proposals for its enforcement (Section 5.3), we discuss possible threats to k-anonymity to which data min- ing results are exposed (Section 5.4). We then illustrate (Section 5.5) possi- ble approaches combining k-anonymity and data mining, distinguishing them depending on whether k-anonymity is enforced directly on the private data (before mining) or on the mined data themselves (either as a post-mining sanitization process or by the mining process itself). For each of the two ap- proaches (Section 5.6 and 5.7, respectively) we discuss possible ways to cap- ture k-anonymity violations to the aim, on the one side, of defining when mined k-Anonymous Data Mining: A Survey 107 results respect k-anonymity of the original data and, on the other side, of identi- fying possible protection techniques for enforcing such a definition of privacy. 5.2 k-Anonymity k-anonymity [11, 26, 27] is a property that captures the protection of re- leased data against possible re-identification of the respondents to whom the released data refer. Consider a private table PT, where data have been de- identified by removing explicit identifiers (e.g., SSN and Name). However, values of other released attributes, such as ZIP, Date of birth, Mari- tal status,andSex can also appear in some external tables jointly with the individual respondents’ identities. If some combinations of values for these attributes are such that their occurrence is unique or rare, then parties observ- ing the data can determine the identity of the respondent to which the data refer or reduce the uncertainty over a limited set of respondents. k-anonymity de- mands that every tuple in the private table being released be indistinguishably related to no fewer than k respondents. Since it seems impossible, or highly impractical and limiting, to make assumptions on which data are known to a potential attacker and can be used to (re-)identify respondents, k-anonymity takes a safe approach requiring that, in the released table itself, the respon- dents be indistinguishable (within a given set of individuals) with respect to the set of attributes, called quasi-identifier, that can be exploited for linking. In other words, k-anonymity requires that if a combination of values of quasi- identifying attributes appears in the table, then it appears with at least k occur- rences. To illustrate, consider a private table reporting, among other attributes, the marital status, the sex, the working hours of individuals, and whether they suffer from hypertension. Assume attributes Marital status, Sex,and Hours are the attributes jointly constituting the quasi-identifier. Figure 5.1 is a simplified representation of the projection of the private table over the quasi- identifier. The representation has been simplified by collapsing tuples with the same quasi-identifying values into a single tuple. The numbers at the right hand side of the table report, for each tuple, the number of actual occurrences, also specifying how many of these occurrences have values Y and N, respectively, for attribute Hypertension. For simplicity, in the following we use such a simplified table as our table PT. The private table PT in Figure 5.1 guarantees k-anonymity only for k ≤ 2. In fact, the table has only two occurrences of divorced (fe)males working 35 hours. If such a situation is satisfied in a particular correlated external table as well, the uncertainty of the identity of such respondents can be reduced to two specific individuals. In other words, a data recipient can infer that any 108 Privacy-Preserving Data Mining: Models and Algorithms Marital status Sex Hours #tuples (Hyp. values) divorced M 35 2(0Y,2N) divorced M 40 17 (16Y, 1N) divorced F 35 2(0Y,2N) married M 35 10 (8Y, 2N) married F 50 9(2Y,7N) single M 40 26 (6Y, 20N) Figure 5.1. Simplified representation of a private table information appearing in the table for such divorced (fe)males working 35 hours, actually pertains to one of two specific individuals. It is worth pointing out a simple but important observation (to which we will come back later in the chapter): if a tuple has k occurrences, then any of its sub-tuples must have at least k-occurrences. In other words, the exis- tence of k occurrences of any sub-tuple is a necessary (not sufficient) condi- tion for having k occurrences of a super-tuple. For instance, with reference to our example, k-anonymity over quasi-identifier {Marital status, Sex, Hours} requires that each value of the individual attributes, as well as of any sub-tuple corresponding to a combination of them, appears with at least k oc- currences. This observation will be exploited later in the chapter to assess the non satisfaction of a k-anonymity constraint for a table based on the fact that a sub-tuple of the quasi-identifier appears with less than k occurrences. Again with reference to our example, the observation that there are only two tuples referring to divorced females allows us to assert that the table will certainly not satisfy k-anonymity for k>2 (since the two occurrences will remain at most two when adding attribute Hours). Two main techniques have been proposed for enforcing k-anonymity on a private table: generalization and suppression, both enjoying the property of preserving the truthfulness of the data. Generalization consists in replacing attribute values with a generalized ver- sion of them. Generalization is based on a domain generalization hierarchy and a corresponding value generalization hierarchy on the values in the domains. Typically, the domain generalization hierarchy is a total order and the corre- sponding value generalization hierarchy a tree, where the parent/child relation- ship represents the direct generalization/specialization relationship. Figure 5.2 illustrates an example of possible domain and value generalization hierarchies for the quasi-identifying attributes of our example. Generalization can be applied at the level of single cell (substituting the cell value with a generalized version of it) or at the level of attribute (generalizing all the cells in the corresponding column). It is easy to see how generaliza- tion can enforce k-anonymity: values that were different in the private table can be generalized to a same value, whose number of occurrences would be k-Anonymous Data Mining: A Survey 109 M2 = {any marital status} M1 = {been married, never married} M0 = {married, divorced, single} any marital status been married rrrrr never married LLLLL married |||| divorced single (a) Marital status S1 = {any sex} S0 = {F,M} any sex F M 3333 (b) Sex H2 = {[1, 100)} H1 = {[1, 40),[40, 100)} H0 = {35, 40, 50} [1, 100) [1, 40) ÖÖÖÖ [40, 100) >>>> 35 40 50 444 (c) Hours Figure 5.2. An example of domain and value generalization hierarchies the sum of the number of occurrences of the values that have been general- ized to it. The same reasoning extends to tuples. Figure 5.11(d) reports the result of a generalization over attribute Sex on the table in Figure 5.1, which resulted, in particular, in divorced people working 35 hours to be collapsed to the same tuple {divorced, any sex, 35}, with 4 occurrences. The ta- ble in Figure 5.11(d) satisfies k-anonymity for any k ≤ 4 (since there are no less than 4 respondents for each combination of values of quasi-identifying at- tributes). Note that 4-anonymity could be guaranteed also by only generalizing (to any sex) the sex value of divorced people (males and females) working 35 hours while leaving the other tuples unaltered, since for all the other tuples not satisfying this condition there are already at least 4 occurrences in the private table. This cell generalization approach has the advantage of avoiding general- izing all values in a column when generalizing only a subset of them suffices to guarantee k-anonymity. It has, however, the disadvantage of not preserving the homogeneity of the values appearing in the same column. Suppression consists in protecting sensitive information by removing it. Suppression, which can be applied at the level of single cell, entire tuple, or entire column, allows reducing the amount of generalization to be enforced to achieve k-anonymity. Intuitively, if a limited number of outliers would force a large amount of generalization to satisfy a k-anonymity constraint, then 110 Privacy-Preserving Data Mining: Models and Algorithms Suppression Generalization Tuple Attribute Cell None Attribute AG TS AG AS ≡ AG AG CS AG ≡ AG AS Cell CG TS CG AS CG CS ≡ CG CG ≡ CG CS not applicable not applicable None TS AS CS not interesting Figure 5.3. Classification of k-anonymity techniques [11] such outliers can be removed from the table thus allowing satisfaction of k- anonymity with less generalization (and therefore reducing the loss of infor- mation). Figure 5.3 summarizes the different combinations of generalization and sup- pression at different granularity levels (including combinations where one of the two techniques is not adopted), which correspond to different approaches and solutions to the k-anonymity problem [11]. It is interesting to note that the application of generalization and suppression at the same granularity level is equivalent to the application of generalization only (AG ≡AG AS and CG ≡CG CS), since suppression can be modeled as a generalization to the top element in the value generalization hierarchy. Combinations CG TS (cell generalization, tuple suppression) and CG AS (cell generalization, attribute suppression) are not applicable since the application of generalization at the cell level implies the application of suppression at that level too. 5.3 Algorithms for Enforcing k-Anonymity The application of generalization and suppression to a private table PT produces less precise (more general) and less complete (some values are sup- pressed) tables that provide protection of the respondents’ identities. It is im- portant to maintain under control, and minimize, the information loss (in terms of loss of precision and completeness) caused by generalization and suppres- sion. Different definitions of minimality have been proposed in the literature and the problem of finding minimal k-anonymous tables, with attribute gener- alization and tuple suppression, has been proved to be computationally hard [2, 3, 22]. Within a given definition of minimality, more generalized tables, all ensur- ing minimal information loss, may exist. While existing approaches typically aim at returning any of such solutions, different criteria could be devised ac- cording to which a solution should be preferred over the others. This aspect is particularly important in data mining, where there is the need to maximize the usefulness of the data with respect to the goal of the data mining process k-Anonymous Data Mining: A Survey 111 (see Section 5.6). We now describe some algorithms proposed in literature for producing k-anonymous tables. Samarati’s Algorithms. The first algorithm for AG TS (i.e., generalization over quasi-identifier attributes and tuple suppression) was proposed in con- junction with the definition of k-anonymity [26]. Since the algorithm operates on a set of attributes, the definition of domain generalization hierarchy is ex- tended to refer to tuples of domains. The domain generalization hierarchy of a domain tuple is a lattice, where each vertex represents a generalized table that is obtained by generalizing the involved attributes according to the corre- sponding domain tuple and by suppressing a certain number of tuples to fulfill the k-anonymity constraint. Figure 5.4 illustrates an example of domain gen- eralization hierarchy obtained by considering Marital status and Sex as quasi-identifying attributes, that is, by considering the domain tuple M0,S0. Each path in the hierarchy corresponds to a generalization strategy according to which the original private table PT can be generalized. The main goal of the algorithm is to find a k-minimal generalization that suppresses less tuples. Therefore, given a threshold MaxSup specifying the maximum number of tu- ples that can be suppressed, the algorithm has to compute a generalization that satisfies k-anonymity within the MaxSup constraint. Since going up in the hi- erarchy the number of tuples that must be removed to guarantee k-anonymity decreases, the algorithm performs a binary search on the hierarchy. Let h be the height of the hierarchy. The algorithm first evaluates all the solutions at height h/2. If there is at least a k-anonymous table that satisfies the MaxSup threshold, the algorithm checks solutions at height h/4; otherwise it evalu- ates solutions at height 3h/4, and so on, until it finds the lowest height where there is a solution that satisfies the k-anonymity constraint. As an example, consider the private table in Figure 5.1 with QI={Marital status, Sex}, the domain and value generalization hierarchies in Figure 5.2, and the gener- alization hierarchy in Figure 5.4. Suppose also that k =4and MaxSup=1. The algorithm first evaluates solutions at height 3/2,thatis,M0,S1 and M2,S1 M1,S1 <xwill belong to one of the resulting regions, while all points with d ≤ x will belong to the other region. Note that this splitting operation is allowed only if there are more than k points within any region. The algorithm terminates when there are no more splitting operations allowed. The tuples within a given region are then generalized to a unique tuple of summary statistics for the considered region. For each quasi- identifying attribute, a summary statistic may simply be a static value (e.g., the average value) or the pair of maximum and minimum values for the attribute in the region. As an example, consider the private table PT in Figure 5.1 and suppose that QI = {Marital status, Sex} and k =10. Figure 5.8(a) illustrates the two dimensional representation of the table for the Mari- tal status and Sex quasi-identifying attributes, where the number asso- ciated with each point corresponds to the occurrences of the quasi-identifier value in PT. Suppose to perform a split operation on the Marital status dimension. The resulting two regions illustrated in Figure 5.8(b) are 10- anonymous. The bottom region can be further partitioned along the Sex dimension, as represented in Figure 5.8(c). Another splitting operation along the Marital status dimension can be performed on the region containing the points that correspond to the quasi-identifying values married,M and divorced,M. Figure 5.8(d) illustrates the final solution. The experimental results [19] show that the Mondrian multidimensional method obtains good solutions for the k-anonymity problem, also compared with k-Optimize and Incognito. Approximation Algorithms. Since the majority of the exact algorithms proposed in literature have computational time exponential in the number of the attributes composing the quasi-identifier, approximation algorithms have been also proposed. Approximation algorithms for CS and CG have been 116 Privacy-Preserving Data Mining: Models and Algorithms MF divorced married single 26 10 19 9 2 (a) MF divorced married single 26 10 19 9 2 (b) MF divorced married single 26 10 19 9 2 (c) MF divorced married single 26 10 19 9 2 (d) Figure 5.8. Spatial representation (a) and possible partitioning (b)-(d) of the table in Figure 5.1 presented, both for general and specific values of k (e.g., 1.5-approximation1 for 2-anonymity, and 2-approximation for 3-anonymity [3]). The first approximation algorithm for CS was proposed by Meyerson and Williams [22] and guarantees a O(k log(k))-approximation. The best-known approximation algorithm for CS is described in [2] and guarantees a O(k)- approximate solution. The algorithm constructs a complete weighted graph from the original private table PT. Each vertex in the graph corresponds to a tuple in PT, and the edges are weighted with the number of different attribute values between the two tuples represented by extreme vertices. The algorithm then constructs, starting from the graph, a forest composed of trees containing at least k vertices, which represents the clustering for k-anonymization. Some cells in the vertices are suppressed to obtain that all the tuples in the same tree have the same quasi-identifier value. The cost of a vertex is evaluated as the number of cells suppressed, and the cost of a tree is the sum of the weights of 1In a minimization framework, a p-approximation algorithm guarantees that the cost C of its solution is such that C/C∗ ≤ p,whereC∗ is the cost of an optimal solution [17]. -Anonymous Data Mining: A Survey 117 its vertices. The cost of the final solution is equal to the sum of the costs of its trees. In constructing the forest, the algorithm limits the maximum number of vertices in a tree to be 3k−3. Partitions with more than 3k−3 elements are de- composed, without increasing the total solution cost. The construction of trees with no more than 3k − 3 vertices guarantees a O(k)-approximate solution. An approximation algorithm for CG is described in [3] as a direct exten- sion of the approximation algorithm for CS presented in [2]. For taking into account the generalization hierarchies, each edge has a weight that is computed as follows. Given two tuples i and j and an attribute a, the generalization cost hi,j(a) associated with a is the lowest level of the value generalization hierar- chy of a such that tuples i and j have the same generalized value for a.The weight w(e) of the edge e =(i, j) is therefore w(e)=Σahi,j(a)/la,wherela is the number of levels in the value generalization hierarchy of a. The solution of this algorithm is guaranteed to be a O(k)-approximation. Besides algorithms that compute k-anonymized tables for any value of k, ad-hoc algorithms for specific values of k have also been proposed. For in- stance, to find better results for Boolean attributes, in the case where k =2or k =3, an ad-hoc approach has been provided in [3]. The algorithm for k =2 exploits the minimum-weight [1, 2]-factor built on the graph constructed for the 2-anonymity. The [1, 2]-factor for graph G is a spanning subgraph of G built using only vertices with no more than 2 outgoing edges. Such a subgraph is a vertex-disjoint collection of edges and pairs of adjacent vertices and can be computed in polynomial time. Each component in the subgraph is treated as a cluster, and a 2-anonymized table is obtained by suppressing each cell, for which the vectors in the cluster differ in value. This procedure is a 1.5- approximation algorithm. The approximation algorithm for k =3is similar and guarantees a 2-approximation solution. 5.4 k-Anonymity Threats from Data Mining Data mining techniques allow the extraction of information from large col- lections of data. Data mined information, even if not explicitly including the original data, is built on them and can therefore allow inferences on origi- nal data to be withdrawn, possibly putting privacy constraints imposed on the original data at risk. This observation holds also for k-anonymity. The desire to ensure k-anonymity of the data in the collection may therefore require to impose restrictions on the possible output of the data mining process. In this section, we discuss possible threats to k-anonymity that can arise from per- forming mining on a collection of data maintained in a private table PT subject to k-anonymity constraints. We discuss the problems for the two main classes of data mining techniques, namely association rule mining and classification mining. k 118 Privacy-Preserving Data Mining: Models and Algorithms 5.4.1 Association Rules The classical association rule mining operates on a set of transactions, each composed of a set of items, and produce association rules of the form X → Y,whereX and Y are sets of items. Intuitively, rule X → Y expresses the fact that transactions that contain items X tend to also contain items Y. Each rule has a support and a confidence, in the form of percentage. The support expresses the percentage of transactions that contain both X and Y, while the confidence expresses the percentage of transactions, among those containing X, that also contain Y. Since the goal is to find common patterns, typically only those rules that have support and confidence greater than some predefined thresholds are considered of interest [5, 28, 31]. Translating association rule mining over a private table PT on which k- anonymity should be enforced, we consider the values appearing in the table as items, and the tuples reporting respondents’ information as transactions. For simplicity, we assume here that the domains of the attributes are disjoint. Also, we assume support and confidence to be expressed in absolute values (in contrast to percentage). The reason for this assumption, which is consistent with the approaches in the literature, is that k-anonymity itself is expressed in terms of absolute numbers. Note, however, that this does not imply that the release itself will be made in terms of absolute values. Association rule mining over a private table PT allows then the extrac- tion of rules expressing combination of values common to different respon- dents. For instance, with reference to the private table in Figure 5.1, rule {divorced}→{M} with support 19, and confidence 19 21 states that 19 tuples in the table refer to divorced males, and among the 21 tuples referring to divorced people 19 of them are male. If the quasi-identifier of table PT contains both at- tributes Marital status and Sex, it is easy to see that such a rule violates any k-anonymity for k>19, since it reflects the existence of 19 respondents who are divorced male (being Marital status and Sex included in the quasi-identifier, this implies that no more than 19 indistinguishable tuples can exist for divorced male respondents). Less trivially, the rule above violates also k-anonymity for any k>2, since it reflects the existence of 2 respondents who are divorced and not male; again, being Marital status and Sex included in the quasi-identifier, this implies that no more than 2 indistinguishable tuples can exist for non male divorced respondents. 5.4.2 Classification Mining In classification mining, a set of database tuples, acting as a training sam- ple, are analyzed to produce a model of the data that can be used as a predictive classification method for classifying new data into classes. Goal of the classi- fication process is to build a model that can be used to further classify tuples k-Anonymous Data Mining: A Survey 119 being inserted and that represents a descriptive understanding of the table con- tent [25]. One of the most popular classification mining techniques is represented by decision trees, defined as follows. Each internal node of a decision tree is as- sociated with an attribute on which the classification is defined (excluding the classifying attributes, which in our example is Hypertension). Each out- going edge is associated with a split condition representing how the data in the training sample are partitioned at that tree node. The form of a split condition depends on the type of the attribute. For instance, for a numerical attribute A, the split condition may be of the form A ≤ v,wherev is a possible value for A. Each node contains information about the number of samples at that node and how they are distributed among the different class values. As an example, the private table PT in Figure 5.1 can be used as a learning set to build a decision tree for predicting if people are likely to suffer from hypertension problems, based on their marital status, if they are male, and on their working hours, if they are female. A possible decision tree for such a case performing the classification based on some values appearing in quasi- identifier attributes is illustrates in Figure 5.9. The quasi-identifier attributes correspond to internal (splitting) nodes in the tree, edges are labeled with (a subset of) attribute values instead of reporting the complete split condition, and nodes simply contain the number of respondents classified by the node values, distinguishing between people suffering (Y) and not suffering (N)of hypertension. While the decision tree does not directly release the data of the private ta- ble, it indeed allows inferences on them. For instance, Figure 5.9 reports the existence of 2 females working 35 hours (node reachable from path F,35). Again, since Sex and Hours belong to the quasi-identifier, this information Sex 32 Y 34 N M yyssssssssssssss F ##HHHHHHHHHHHHH Marital status 30 Y 25 N married ÑÑÓÓÓÓÓÓÓ divorced  single <<<<<<< Hours 2 Y 9 N 35 ÓÓØØØØØØ 50 666666 8 Y 2 N 16 Y 3 N 6 Y 20 N 0 Y 2 N 2 Y 7 N Figure 5.9. An example of decision tree 120 Privacy-Preserving Data Mining: Models and Algorithms reflects the existence of no more than two respondents for such occurrences of values, thus violating k-anonymity for any k>2. Like for association rules, threats can also be possible by combining classifications given by different nodes along the same path. For instance, considering the decision tree in Fig- ure 5.9, the combined release of the nodes reachable from paths F (with 11 occurrences) and F, 50 (with 9 occurrences) allows to infer that there are 2 female respondents in PT who do not work 50 hours per week. 5.5 k-Anonymity in Data Mining Section 5.4 has illustrated how data mining results can compromise the k- anonymity of a private table, even if the table itself is not released. Since proper privacy guarantees are a must for enabling information sharing, it is then im- portant to devise solutions ensuring that data mining does not open the door to possible privacy violations. With particular reference to k-anonymity, we must ensure that k-anonymity for the original table PT be not violated. There are two possible approaches to guarantee k-anonymity in data mining. Anonymize-and-Mine: anonymize the private table PT and perform min- ing on its k-anonymous version. Mine-and-Anonymize: perform mining on the private table PT and anonymize the result. This approach can be performed by executing the two steps independently or in combination. Figure 5.10 provides a graphical illustration of these approaches, reporting, for the Mine-and-Anonymize approach, the two different cases: one step or two Anonymize-and-Mine ___     ___ PT anonymize //______ PTk mine // MDk Mine-and-Anonymize ___     ___ PT mine //______ ___     ___ MD anonymize //______ MDk ___     ___ PT anonymized mining //_______________ MDk Figure 5.10. Different approaches for combining k-anonymity and data mining k-Anonymous Data Mining: A Survey 121 steps. In the figure, boxes represent data, while arcs represent processes pro- ducing data from data. The different data boxes are: PT, the private table; PTk, an anonymized version of PT; MD, a result of a data mining process (without any consideration of k-anonymity constraints); and MDk, a result of a data mining process that respects the k-anonymity constraint for the private table PT. Dashed lines for boxes and arcs denote data and processes, respectively, reserved to the data holder, while continuous lines denote data and processes that can be viewed and executed by other parties (as their visibility and execu- tion does not violate the k-anonymity for PT). Let us then discuss the two approaches more in details and their trade-offs between applicability and efficiency of the process on the one side, and utility of data on the other side. Anonymize-and-Mine (AM) This approach consists in applying a k- anonymity algorithm on the original private table PT and releasing then a table PTk that is a k-anonymized version of PT. Data mining is performed, by the data holder or even external parties, on PTk.The advantage of such an approach is that it allows the decoupling of data protection from mining, giving a double benefit. First, it guarantees that data mining is safe: since data mining is executed on PTk (and not on PT), by definition the data mining results cannot violate k-anonymity for PT. Second, it allows data mining to be executed by others than the data holder, enabling different data mining processes and different uses of the data. This is convenient, for example, when the data holder may not know a priori how the recipient may analyze and classify the data. Moreover, the recipient may have application-specific data min- ing algorithms and she may want to directly define parameters (e.g., accuracy and interpretability) and decide the mining method only af- ter examining the data. On the other hand, the possible disadvantages of performing mining on anonymized data is that mining operates on less specialized and complete data, therefore usefulness and significance of the mining results can be compromised. Since classical k-anonymity approaches aim at satisfying k-anonymity minimizing information loss (i.e., minimizing the amount of generalization and suppression adopted), a k-anonymity algorithm may produce a result that is not suited for min- ing purposes. As a result, classical k-anonymity algorithms may hide information that is highly useful for data mining purposes. Particular care must then be taken in the k-anonymization process to ensure maxi- mal utility of the k-anonymous table PTk with respect to the goals of the data mining process that has to be executed. In particular, the aim of k- anonymity algorithms operating on data intended for data mining should not be the mere minimization of information loss, but the optimization of a measure suitable for data mining purposes. A further limitation of 122 Privacy-Preserving Data Mining: Models and Algorithms the Anonymize-and-Mine approach is that it is not applicable when the input data can be accessed only once (e.g., when the data source is a stream). Also, it may be overall less efficient, since the anonymization process may be quite expensive with respect to the mining one, espe- cially in case of sparse and large databases [1]. Therefore, performing k-anonymity before data mining is likely to be more expensive than do- ing the contrary. Mine-and-Anonymize (MA) This approach consists in mining original non- k-anonymous data, performing data mining on the original table PT, and then applying an anonymization process on the data mining result. Data mining can then be performed by the data holder only, and only the sanitized data mining results (MDk) are released to other parties. The definition of k-anonymity must then be adapted to the output of the data mining phase. Intuitively, no inference should be possible on the mined data allowing violating k-anonymity for the original table PT. This does not mean that the table PT must be k-anonymous, but that if it was not, it should not be known and the effect of its non being k-anonymous be not visible in the mined results. In the Mine-and-Anonymize approach, k-anonymity constraints can be taken into consideration after data min- ing is complete (two-step Mine-and-Anonymize) or within the mining process itself (one-step Mine-and-Anonymize). In two-step Mine-and- Anonymize the result needs to be sanitized removing from MD all data that would compromise k-anonymity for PT.Inone-step Mine-and- Anonymize the data mining algorithm needs to be modified so to en- sure that only results that would not compromise k-anonymity for PT are computed (MDk). The two possible implementations (one step vs two steps) provide different trade-offs between applicability and effi- ciency: two-step Mine-and-Anonymize does not require any modifica- tion to the mining process and therefore can use any data mining tool available (provided that results are then anonymized); one-step Mine- and-Anonymize requires instead to redesign data mining algorithms and tools to directly enforce k-anonymity, combining the two steps can how- ever result in a more efficient process giving then performance advan- tages. Summarizing, the main drawback of Mine-and-Anonymize is that it requires mining to be executed only by the data holder (or parties au- thorized to access the private table PT). This may therefore impact ap- plicability. The main advantages are efficiency of the mining process and quality of the results: performing mining before, or together with, anonymization can in fact result more efficient and allow to keep data distortion under control to the goal of maximizing the usefulness of the data. k-Anonymous Data Mining: A Survey 123 5.6 Anonymize-and-Mine The main objective of classical k-anonymity techniques is the minimiza- tion of information loss. Since a private table may have more than one mini- mal k-anonymous generalization, different preference criteria can be applied in choosing a minimal generalization, such as minimum absolute distance, min- imum relative distance, maximum distribution, or minimum suppression [26]. In fact, the strategies behind heuristics for k-anonymization can be typically based on preference criteria or even user policies (e.g., the discourage of the generalization of some given attributes). In the context of data mining, the main goal is retaining useful information for data mining, while determining a k-anonymization that protects the respon- dents against linking attacks. However, it is necessary to define k-anonymity algorithms that guarantee data usefulness for subsequent mining operations. A possible solution to this problem is the use of existing k-anonymizing algo- rithms, choosing the maximization of the usefulness of the data for classifica- tion as a preference criteria. Recently, two approaches that anonymize data before mining have been pre- sented for classification (e.g., decision trees): a top-down [16] and a bottom- up [29] technique. These two techniques aim at releasing a k-anonymous table T(A1,...,Am, class) for modeling classification of attribute class consider- ing the quasi-identifier QI = {A1,...,Am}. k-anonymity is achieved with cell generalization and cell suppression (CG ), that is, different cells of the same attribute may have values belonging to different generalized domains. The aim of preserving anonymity for classification is then to satisfy the k-anonymity constraint while preserving the classification structure in the data. The top-down approach starts from a table containing the most general val- ues for all attributes and tries to refine (i.e., specialize) some values. For in- stance, the table in Figure 5.11(a) represents a completely generalized table for the table in Figure 5.1. The bottom-up approach starts from a private table and tries to generalize the attributes until the k-anonymity constraint is satisfied. In the top-down technique a refinement is performed only if it has some suitable properties for guaranteeing both anonymity and good classification. For this purpose, a selection criterion is described for guiding the top-down refinement process to heuristically maximize the classification goal. The re- finement has two opposite effects: it increases the information of the table for classification and it decreases its anonymity. The algorithm is guided by the functions InfoGain(v) and AnonyLoss(v) measuring the information gain and the anonymity loss, respectively, where v is the attribute value (cell) candidate for refinement. A good candidate v is such that InfoGain(v) is large, and Anony- Loss(v) is small. Thus, the selection criterion for choosing the candidate v to be refined maximizes function Score(v) = InfoGain(v) AnonyLoss(v)+1. Function Score(v) 124 Privacy-Preserving Data Mining: Models and Algorithms is computed for each value v of the attributes in the table. The value with the highest score is then specialized to its children in the value generalization hi- erarchy. An attribute value v, candidate for specialization, is considered useful to obtain a good classification if the frequencies of the class values are not uni- formly distributed for the specialized values of v. The entropy of a value in a table measures the dominance of the majority: the more dominating the major- ity value in the class is, the smaller the entropy is. InfoGain(v) then measures the reduction of entropy after refining v (for a formal definition of InfoGain(v) see [16]). A good candidate is a value v that reduces the entropy of the table. For instance, with reference to the private table in Figure 5.1 and its gener- alized version in Figure 5.11(a), InfoGain(any marital status)ishigh since for been married we have 14 N and 26 Y, with a difference of 12, and for never married we have 20 N and 6 Y, with a difference of 14 (see Marital status Sex Hours #tuples (Hyp. values) any marital status any sex [1,100) 66 (32Y, 34N) (a) Step 1: the most general table Marital status Sex Hours #tuples (Hyp. values) been married any sex [1,100) 40 (26Y, 14N) never married any sex [1,100) 26 (6Y, 20N) (b) Step 2 Marital status Sex Hours #tuples (Hyp. values) divorced any sex [1,100) 21 (16Y, 5N) married any sex [1,100) 19 (10Y, 9N) never married any sex [1,100) 26 (6Y, 20N) (c) Step 3 Marital status Sex Hours #tuples (Hyp. values) divorced any sex 35 4(0Y,4N) divorced any sex 40 17 (16Y, 1N) married any sex 35 10 (8Y, 2N) married any sex 50 9(2Y,7N) single any sex 40 26 (6Y, 20N) (d) Final table (after 7 steps) Figure 5.11. An example of top-down anonymization for the private table in Figure 5.1 k-Anonymous Data Mining: A Survey 125 Figure 5.11(b)). On the contrary, InfoGain([1, 100)) is low since for [0, 40) we have 8 Yand6 N, with a difference of 2, and for [40, 100) we have 24 Y and 28 N, with a difference of 2. Thus Marital status is more useful for classification than Hours. Let us define the anonymity degree of a table as the maximum k for which the table is k-anonymous. The loss of anonymity, defined as AnonyLoss(v),is the difference between the degrees of anonymity of the table before and after refining v. For instance, the degrees of the tables in Figures 5.11(b) and 5.11(c) are 26 (tuples containing: never married, any sex,[1,100)) and 19 (tuples containing: married, any sex,[1,100)), respectively. Since the table in Figure 5.11(c) is obtained by refining the value been married of the table in Figure 5.11(b), AnonyLoss(been married)is7. The algorithm terminates when any further refinement would violate the k- anonymity constraint. Example 5.1 Consider the private table in Figure 5.1, and the value gener- alization hierarchies in Figure 5.2. Let us suppose QI = {Marital status, Sex, Hours} and k =4. The algorithm starts from the most generalized table in Figure 5.11(a), and computes the scores: Score(any marital status), Score(any sex), and Score([1, 100)). Since the maximum score corresponds to value any marital status,this value is refined, producing the table in Figure 5.11(b). The remaining ta- bles computed by the algorithm are shown in Figures 5.11(c), and 5.11(d). Figure 5.11(d) illustrates the final table since the only possible refinement (any sex to M and F) violates 4-anonymity. Note that the final table is 4- anonymous with respect to QI = {Marital status, Sex, Hours}. The bottom-up approach is the dual of the top-down approach. Starting from the private table, the objective of the bottom-up approach is to generalize the values in the table to determine a k-anonymous table preserving good qualities for classification and minimizing information loss. The effect of generalization is thus measured by a function involving anonymity gain (instead of anonymity loss) and information loss. Note that, since these methods compute a minimal k-anonymous table suit- able for classification with respect to class and QI, the computed table PTk is optimized only if classification is performed using the entire set QI.Other- wise, the obtained table PTk could be too general. For instance, consider the table in Figure 5.1, the table in Figure 5.11(d) is a 4-anonymization for it con- sidering QI = {Marital status, Sex, Hours}. If classification is to be done with respect to a subset QI = {Marital status, Sex} of QI, such a table would be too general. As a matter of fact, a 4-anonymization for PT with respect to QI can be obtained from PT by simply generalizing di- vorced and married to been married. This latter generalization would 126 Privacy-Preserving Data Mining: Models and Algorithms generalize only 40 cells, instead of the 66 cells (M and F to any sex) gener- alized in the table in Figure 5.11(d). 5.7 Mine-and-Anonymize The Mine-and-Anonymize approach performs mining on the original ta- ble PT. Anonymity constraints must therefore be enforced with respect to the mined results to be returned. Regardless of whether the approach is executed in one or two steps (see Section 5.5), the problem to be solved is to translate k- anonymity constraints for PT over the mined results. Intuitively, the mined re- sults should not allow anybody to infer the existence of sets of quasi-identifier values that have less than k occurrences in the private table PT. Let us then discuss what this implies for association rules and for decision trees. 5.7.1 Enforcing k-Anonymity on Association Rules To discuss k-anonymity for association rules it is useful to distinguish the two different phases of association rule mining: 1 find all combinations of items whose support (i.e., the number of joint occurrences in the records) is greater than a minimum threshold σ (fre- quent itemsets mining); 2 use the frequent itemsets to generate the desired rules. The consideration of these two phases conveniently allows expressing k- anonymity constraints with respect to observable itemsets instead of associa- tion rules. Intuitively, k-anonymity for PT is satisfied if the observable itemsets do not allow inferring (the existence of) sets of quasi-identifier values that have less than k occurrences in the private table. It is trivial to see that any itemset X that includes only values on quasi-identifier attributes and with a support lower than k is clearly unsafe. In fact, the information given by the itemset corresponds to stating that there are less than k respondents with occurrences of values as in X, thus violating k-anonymity. Besides trivial itemsets such as this, also the combination of itemsets with support greater than or equal to k can breach k-anonymity. As an example, consider the private table in Figure 5.1, where the quasi- identifier is {Marital status, Sex, Hours} and suppose 3-anonymity must be guaranteed. All itemsets with support lower than 3 clearly violate the constraint. For instance, itemset {divorced,F} with support 2, which holds in the table, cannot be released. Figure 5.12 illustrates some examples of item- sets with support greater than or equal to 19 (assuming lower supports are not of interest). While one may think that releasing these itemsets guarantees any k-anonymity for k ≤ 19, it is not so. Indeed, the combination of the two item- sets {divorced,M}, with support 19, and {divorced}, with support 21, k-Anonymous Data Mining: A Survey 127 Itemset Support {∅} 66 {M} 55 {M, 40} 43 {single, M, 40} 26 {divorced} 21 {divorced, M} 19 {married} 19 Figure 5.12. Frequent itemsets extracted from the table in Figure 5.1 clearly violates it. In fact, from their combination we can infer the existence of two tuples in the private table for which the condition ‘Marital status = divorced ∧¬(Sex = M)’ is satisfied. Being Marital status and Sex included in the quasi-identifier, this implies that no more than 2 indistinguish- able tuples can exist for divorced non male respondents, thus violating k- anonymity for k>2. In particular, since Sex can assume only two values, the two itemsets above imply the existence of (not released) itemset {divorced, F} with support 2. Note that, although both itemsets ({divorced}, 21) and ({divorced,M}, 19) cannot be released, there is no reason to suppress both, since each of them individually taken is safe. The consideration of inferences such as those, and of possible solutions for suppressing itemsets to block the inferences while maximizing the utility of the released information, bring some resembling with the primary and secondary suppression operations in statistical data release [12]. It is also important to note that suppression is not the only option that can be applied to sanitize a set of itemsets so that no unsafe inferences violating k-anonymity are possible. Alternative approaches can be investigated, including adapting classical sta- tistical protection strategies [12, 14]. For instance, itemsets can be combined, essentially providing a result that is equivalent to operating on generalized (in contrast to specific) data. Another possible approach consists in introducing noise in the result, for example, modifying the support of itemsets in such a way that their combination never allows inferring itemsets (or patterns of them) with support lower than the specified k. A first investigation of translating the k-anonymity property of a private table on itemsets has been carried out in [7–9] with reference to private ta- bles where all attributes are defined on binary domains. The identification of unsafe itemsets bases on the concept of pattern, which is a boolean for- mula of items, and on the following observation. Let X and X ∪{Ai} be two itemsets. The support of pattern X ∧¬Ai can be obtained by subtract- ing the support of itemset X ∪{Ai} from the support of X. By generalizing this observation, we can conclude that given two itemsets X = {Ax1 ...Axn } 128 Privacy-Preserving Data Mining: Models and Algorithms and Y = {Ax1 ...Axn ,Ay1 ...Aym }, with X ⊂ Y, the support of pattern Ax1 ∧ ...∧ Axn ∧¬Ay1 ∧ ...∧¬Aym (i.e., the number of tuples in the table containing X but not Y − X) can be inferred from the support of X,Y,and all itemsets Z such that X ⊂ Z ⊂ Y. This observation allows stating that a set of itemsets satisfies k-anonymity only if all itemsets, as well as the patterns derivable from them, have support greater than or equal to k. As an example, consider the private table PT in Figure 5.13(a), where all attributes can assume two distinct values. This table can be trans- formed into the binary table T in Figure 5.13(b), where A corresponds to ‘Marital status = been married’, B corresponds to ‘Sex = M’, and C corresponds to ‘Hours = [40,100)’. Figure 5.14 reports the lattice of all itemsets derivable from T together with their support. Assume that all item- sets with support greater than or equal to the threshold σ =40, represented in Figure 5.15(a), are of interest, and that k =10. The itemsets in Figure 5.15(a) present two inference channels. The first inference is obtained through itemsets X1 = {C} with support 52, and Y1 = {BC} with support 43. According to Marital status Sex Hours #tuples been married M [1-40) 12 been married M [40-100) 17 been married F [1-40) 2 been married F [40-100) 9 never married M [40-100) 26 (a) PT ABC#tuples 11012 11117 1002 1019 01126 (b) T Figure 5.13. An example of binary table ABC yyyyyyy EEEEEEE 17 AB EEEEEEE29 AC yyyyyyy EEEEEEE26 BC yyyyyyy 43 A DDDDDDDD40 B 55 C yyyyyyyy 52 ∅ 66 Figure 5.14. Itemsets extracted from the table in Figure 5.13(b) k-Anonymous Data Mining: A Survey 129 BC ~~~~~~~ 43 A ;;;;;;40 B 55 C  52 ∅ 66 (a) BC ~~~~~~~ 43 A ;;;;;;40 B 55 C  62 ∅ 86 (b) Figure 5.15. Itemsets with support at least equal to 40 (a) and corresponding anonymized itemsets (b) the observation previously mentioned, since X1 ⊂ Y1, we can infer that pattern C ∧¬B has support 52 − 43 = 9. The second inference channel is obtained through itemsets X2 ={∅} with support 66, Y2 = {BC} with support 43, and all itemsets Z such that X2 ⊂ Z ⊂ Y2, that is, itemsets {B} with support 55, and {C} with support 52. The support of pattern ¬B ∧¬C can then be ob- tained by applying again the observation previously mentioned. Indeed, from {BC} and {B} we infer pattern B ∧¬C with support 55−43 = 12, and from {BC} and {C} we infer pattern ¬B ∧ C with support 52 − 43 = 9.Since the support of itemset {∅} corresponds to the total number of tuples in the bi- nary table, the support of ¬B ∧¬C is computed by subtracting the support of B ∧¬C(12), ¬B ∧ C(9), and B ∧ C(43) from the support of {∅},thatis, 66−12−9−43 = 2. The result is that release of the itemsets in Figure 5.15(a) would not satisfy k-anonymity for any k>2. In [9] the authors present an algorithm for detecting inference channels that is based on a classical data mining solution for concisely representing all frequent itemsets (closed itemsets [24]) and on the definition of maximal inference channels. In the same work, the authors propose to block possi- ble inference channels violating k-anonymity by modifying the support of in- volved itemsets. In particular, an inference channel due to a pair of itemsets X = {Ax1 ...Axn } and Y = {Ax1 ...Axn ,Ay1 ...Aym } is blocked by in- creasing the support of X by k. In addition, to avoid contradictions among the released itemsets, also the support of all subsets of X is increased by k.Forin- stance, with respect to the previous two inference channels, since k is equal to 10, the support of itemset {C} is increased by 10 and the support of {∅} is in- creased by 20, because {∅} is involved in the two channels. Figure 5.15(b) illustrates the resulting anonymized itemsets. Another possible strategy for blocking channels consists in decreasing the support of the involved itemsets to zero. Note that this corresponds basically to removing some tuples in the original table. 130 Privacy-Preserving Data Mining: Models and Algorithms 5.7.2 Enforcing k-Anonymity on Decision Trees Like for association rules, a decision tree satisfies k-anonymity for the pri- vate table PT from which the tree has been built if no information in the tree allows inferring quasi-identifier values that have less than k occurrences in the private table PT. Again, like for association rules, k-anonymity breaches can be caused by individual pieces of information or by combination of appar- ently anonymous information. In the following, we briefly discuss the problem distinguishing two cases depending on whether the decision tree reports fre- quencies information for the internal nodes also or for the leaves only. Let us first consider the case where the tree reports frequencies informa- tion for all the nodes in the tree. An example of such a tree is reported in Figure 5.9. With a reasoning similar to that followed for itemsets, given a k, all nodes with a number of occurrences lower than k are unsafe as they breach k-anonymity. For instance, the fourth leaf (reachable through path F,35) is unsafe for any k-anonymity higher than 2. Again, with a reasoning simi- lar to that followed for itemsets, also combinations of nodes that allow infer- ring patterns of tuples containing quasi-identifying attributes with a number of occurrences lower than k breach k-anonymity for the given k. For instance, nodes corresponding to paths F and to F,50, which taken individually would appear to satisfy any k-anonymity constraint for k ≤ 9, considered in combination would violate any k-anonymity for k>2 since their com- bination allows inferring that there are no more than two tuples in the table referring to females working a number of hours different from 50. It is inter- esting to draw a relationship between decision trees and itemsets. In particular, any node in the tree corresponds to an itemset dictated by the path to reach the node. For instance, with reference to the tree in Figure 5.9, the nodes corre- spond to itemsets: {},{M},{M,married},{M,divorced},{M,single}, {F},{F,35},{F,40},{F,50}, where the support of each itemset is the sum of the YsandNs in the corresponding node. This observation can be exploited for translating approaches for sanitizing itemsets for the sanitization of deci- sion trees (or viceversa). With respect to blocking inference channels, different approaches can be used to anonymize decision trees, including suppression of unsafe nodes as well as other nodes as needed to block combinations breaching anonymity (secondary suppression). To illustrate, suppose that 3-anonymity is to be guaranteed. Figure 5.16 reports a 3-anonymized version of the tree in Figure 5.9. Here, besides suppressing node F,35, its sibling F,50 has been suppressed to block the inference channel described above. Let us now consider the case where the tree reports frequencies information only for the leaf nodes. Again, there is an analogy with the itemset problem with the additional consideration that, in this case, itemsets are such that none k-Anonymous Data Mining: A Survey 131 Sex 32 Y 34 N M ÑÑÔÔÔÔÔÔÔ F 66666666 Marital status 30 Y 25 N married ÑÑÓÓÓÓÓÓÓ divorced  single ======= 2 Y 9 N 8 Y 2 N 16 Y 3 N 6 Y 20 N Figure 5.16. 3-anonymous version of the tree of Figure 5.9 of them is a subset of another one. It is therefore quite interesting to note that the set of patterns of tuples identified by the tree nodes directly corresponds to a generalized version of the private table PT, where some values are sup- pressed (CG ). This property derives from the fact that, in this case, every tuple in PT satisfies exactly one pattern (path to a leaf). To illustrate, consider the de- cision tree in Figure 5.17, obtained from the tree in Figure 5.9 by suppressing occurrences in non-leaf nodes. Each leaf in the tree corresponds to a general- ized tuple reporting the value given by the path (for attributes appearing in the path). The number of occurrences of such a generalized tuple is reported in the leaf. If a quasi-identifier attribute does not appear along the path, then its value is set to ∗. As a particular case, if every path in the tree contains all the quasi- identifier attributes and puts conditions on specific values, the generalization coincides with the private table PT. For instance, Figure 5.18 reports the table containing tuple patterns that can be derived from the tree in Figure 5.17, and which corresponds to a generalization of the original private table PT in Fig- ure 5.1. The relationship between trees and generalized tables is very important as it allows us to express the protection enjoyed of a decision tree in terms of the generalized table corresponding to it, with the advantage of possibly ex- ploiting classical k-anonymization approaches referred to the private table. In particular, this observation allows us to identify as unsafe all and only those nodes corresponding to tuples whose number of occurrences is lower than k. In other words, in this case (unlike for the case where frequencies of internal nodes values are reported) there is no risk that combination of nodes, each with occurrences higher than or equal to k, can breach k-anonymity. Again, different strategies can be applied to protect decision trees in this case, including exploiting the correspondence just withdrawn, translating on 132 Privacy-Preserving Data Mining: Models and Algorithms Sex M uulllllllllllllll F ''OOOOOOOOOOOO Marital status married }}{{{{{{{{ divorced single ""DDDDDDDD Hours 35 ÑÑÓÓÓÓÓÓÓ 50 ;;;;;;; 8 Y 2 N 16 Y 3 N 6 Y 20 N 0 Y 2 N 2 Y 7 N Figure 5.17. Suppression of occurrences in non-leaf nodes in the tree in Figure 5.9 Marital status Sex Hours #tuples (Hyp. values) divorced M ∗ 19 (16Y, 3N) ∗ F35 2(0Y,2N) married M ∗ 10 (8Y, 2N) ∗ F50 9(2Y,7N) single M ∗ 26 (6Y, 20N) Figure 5.18. Table inferred from the decision tree in Figure 5.17 Sex M ||xxxxxxxxx F 7777777 Marital status been married  single !!DDDDDDDDD 2 Y 9 N 24 Y 5 N 6 Y 20 N Figure 5.19. 11-anonymous version of the tree in Figure 5.17 the tree the generalization and suppression operations that could be executed on the private table. To illustrate, consider the tree in Figure 5.17, the cor- responding generalized table is in Figure 5.18, which clearly violates any k- anonymity for k>2. Figure 5.19 illustrates a sanitized version of the tree for guaranteeing 11-anonymity obtained by suppressing the splitting node Hours and combining nodes M,married and M,divorced into a single node. Note how the two operations have a correspondence with reference to the start- ing table in Figure 5.18 with an attribute generalization over Hours and a cell generalization over Marital status, respectively. Figure 5.20 illustrates the table corresponding to the tree in Figure 5.19. The problem of sanitizing decision trees has been studied in the literature by Friedman et al. [15, 16], who proposed a method for directly building a k-Anonymous Data Mining: A Survey 133 Marital status Sex Hours #tuples (Hyp. values) been married M ∗ 29 (24Y, 5N) ∗ F ∗ 11 (2Y, 9N) single M ∗ 26 (6Y, 20N) Figure 5.20. Table inferred from the decision tree in Figure 5.19 k-anonymous decision tree from a private table PT. The proposed algorithm is basically an improvement of the classical decision tree building algorithm, combining mining and anonymization in a single process. At initialization time, the decision tree is composed of a unique root node, representing all the tuples in PT. At each step, the algorithm inserts a new splitting node in the tree, by choosing the attribute in the quasi-identifier that is more useful for classification purposes, and updates the tree accordingly. If the tree obtained is non-k-anonymous, then the node insertion is rolled back. The algorithm stops when no node can be inserted without violating k-anonymity, or when the clas- sification obtained is considered satisfactory. 5.8 Conclusions A main challenge in data mining is to enable the legitimate usage and shar- ing of mined information while at the same time guaranteeing proper pro- tection of the original sensitive data. In this chapter, we have discussed how k-anonymity can be combined with data mining for protecting the identity of the respondents to whom the data being mined refer. We have described the possible threats to k-anonymity that can arise from performing mining on a collection of data and characterized two main approaches to combine k- anonymity in data mining. We have also discussed different methods that can be used for detecting k-anonymity violations and consequently eliminate them in association rule mining and classification mining. k-anonymous data mining is however a recent research area and many is- sues are still to be investigated such as: the combination of k-anonymity with other possible data mining techniques; the investigation of new approaches for detecting and blocking k-anonymity violations; and the extension of current approaches to protect the released data mining results against attribute, in con- trast to identity, disclosure [21]. Acknowledgements This work was supported in part by the European Union under contract IST- 2002-507591, by the Italian Ministry of Research Fund for Basic Research (FIRB) under project “RBNE05FKZ2”, and by the Italian MIUR under project 2006099978. 134 Privacy-Preserving Data Mining: Models and Algorithms References [1] Charu C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proc. of the 31th VLDB Conference, Trondheim, Norway, September 2005. [2] Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Mot- wani, Rina Panigrahy, Dilys Thomas, and An Zhu. Anonymizing ta- bles. In Proc. of the 10th International Conference on Database Theory (ICDT’05), Edinburgh, Scotland, January 2005. [3] Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Mot- wani, Rina Panigrahy, Dilys Thomas, and An Zhu. Approximation al- gorithms for k-anonymity. Journal of Privacy Technology, November 2005. [4] Dakshi Agrawal and Charu C. Aggarwal. On the design and quantifica- tion of privacy preserving data mining algorithms. In Proc. of the 20th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, California, June 2001. [5] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, September 1994. [6] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, Texas, May 2000. [7] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. Blocking anonymity threats raised by frequent itemset mining. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, Texas, November 2005. [8] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. k-anonymous patterns. In Proc. of the 9th European Confer- ence on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, October 2005. [9] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. Anonymity preserving pattern discovery. VLDB Journal,No- vember 2006. [10] Roberto J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In Proc. of the International Conference on Data En- gineering (ICDE’05), Tokyo, Japan, April 2005. [11] Valentina Ciriani, Sabrina De Capitani di Vimercati, Sara Foresti, and Pierangela Samarati. k-anonymity. In T. Yu and S. Jajodia, editors, Se- curity in Decentralized Data Management. Springer, Berlin Heidelberg, 2007. k-Anonymous Data Mining: A Survey 135 [12] Valentina Ciriani, Sabrina De Capitani di Vimercati, Sara Foresti, and Pierangela Samarati. Microdata protection. In T. Yu and S. Jajodia, edi- tors, Security in Decentralized Data Management. Springer, Berlin Hei- delberg, 2007. [13] Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Jo- hannes Gehrke. Privacy preserving mining of association rules. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, Edmonton, Alberta, Canada, July 2002. [14] Federal Committee on Statistical Methodology. Statistical policy work- ing paper 22, May 1994. Report on Statistical Disclosure Limitation Methodology. [15] Arik Friedman, Assaf Schuster, and Ran Wolff. Providing k-anonymity in data mining. VLDB Journal. Forthcoming. [16] Benjamin C.M. Fung, Ke Wang, and Philip S. Yu. Anonymizing classi- fication data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering, 19(5):711–725, May 2007. [17] Michael R. Garey and David S. Johnson Computers and Intractability. W. H. Freeman & Co., New York, NY, USA, 1979. [18] Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Incognito: efficient full-domain k-anonymity. In Proc. of the ACM SIGMOD Con- ference on Management of Data, Baltimore, Maryland, June 2005. [19] Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Mondrian multidimensional k-anonymity. In Proc. of the International Conference on Data Engineering (ICDE’06), Atlanta, Georgia, April 2006. [20] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Jour- nal of Cryptology, 15(3):177–206, June 2002. [21] Ashwin Machanavajjhala, Johannes Gehrke, and Daniel Kifer. -density: Privacy beyond k-anonymity. In Proc. of the International Conference on Data Engineering (ICDE’06), Atlanta, Georgia, April 2006. [22] Adam Meyerson and Ryan Williams On the complexity of optimal k- anonymity. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART Sym- posium on Principles of Database Systems, Paris, France, June 2004. [23] Hyoungmin Park and Kyuseok Shim. Approximate algorithms for k- anonymity. In Proc. of the ACM SIGMOD Conference on Management of Data, Beijing, China, June 2007. [24] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discov- ering frequent closed itemsets for association rules. In Proc. of the 7th International Conference on Database Theory (ICDT ’99), Jerusalem, Is- rael, January 1999. 136 Privacy-Preserving Data Mining: Models and Algorithms [25] Rajeev Rastogi and Kyuseok Shim. PUBLIC: A decision tree classifier that integrates building and pruning. In Proc. of the 24th VLDB Confer- ence, New York, September 1998. [26] Pierangela Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, November 2001. [27] Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proc. of the 17th ACM-SIGMOD-SIGACT-SIGART Symposium on the Principles of Data- base Systems, page 188, Seattle, WA, 1998. [28] Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized associ- ation rules. In Proc. of the 21th VLDB Conference, Zurich, Switzerland, September 1995. [29] Ke Wang, Philip S. Yu, and Sourav Chakraborty. Bottom-up generaliza- tion: A data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004), Brighton, UK, November 2004. [30] Zhiqiang Yang, Sheng Zhong, and Rebecca N. Wright. Privacy- preserving classification of customer data without loss of accuracy. In Proc. of the 5th SIAM International Conference on Data Mining,New- port Beach, California, April 2005. [31] Mohammed J. Zaki and Ching-Jui Hsiao. Charm: An efficient algorithm for closed itemset mining. In Proc. of the 2nd SIAM International Con- ference on Data Mining, Arlington, Virginia, April 2002. Chapter 6 A Survey of Randomization Methods for Privacy-Preserving Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@us.ibm.com Abstract A well known method for privacy-preserving data mining is that of random- ization. In randomization, we add noise to the data so that the behavior of the individual records is masked. However, the aggregate behavior of the data dis- tribution can be reconstructed by subtracting out the noise from the data. The reconstructed distribution is often sufficient for a variety of data mining tasks such as classification. In this chapter, we will provide a survey of the random- ization method for privacy-preserving data mining. Keywords: Randomization, privacy quantification, perturbation. 6.1 Introduction In the randomization method, we add noise to the data in order to mask the values of the records. The noise added is sufficiently large so that the individ- ual values of the records can no longer be recovered. However, the probabil- ity distribution of the aggregate data can be recovered and subsequently used for privacy-preservation purposes. The earliest work on randomization may be found in [16, 12], in which it has been used in order to eliminate evasive an- swer bias. In [3] it has been shown how the reconstructed distributions may be 138 Privacy-Preserving Data Mining: Models and Algorithms used for data mining. The specific problem which has been discussed in [3] is that of classification, though the approach can be easily extended to a variety of other problems such as association rule mining [8, 24]. The method of randomization can be described as follows. Consider a set of data records denoted by X = {x1 ...xN}. For record xi ∈ X,weadd a noise component which is drawn from the probability distribution fY(y). These noise components are drawn independently, and are denoted y1 ...yN. Thus, the new set of distorted records are denoted by x1 + y1 ...xN + yN. We denote this new set of records by z1 ...zN. In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data. Thus, the original records cannot be recovered, but the distribution of the original records can be recovered. We note that the addition of X and Y creates a new distribu- tion Z. We know N instantiations of this new distribution, and can therefore estimate it approximately. Furthermore, since the distribution of Y is publicly known, we can estimate the distribution obtained by subtracting Y from Z. In a later section, we will discuss more accurate strategies for distribution es- timation. Furthermore, the above-mentioned technique is an additive strategy for randomization. In the multiplicative strategy, it is possible to multiply the records with random vectors in order yo provide the final representation of the data. Thus, this approach uses a random projection kind of approach in or- der to perform the privacy-preserving transformation. The resulting data can be be re-constructed within a certain variance depending upon the number of components of the multiplicative perturbation. We note that methods such as randomization add or multiply the noise to the records in a data-independent way. In other methods such as k-anonymity [25], the overall behavior of the records is leveraged in the anonymization process. This is very useful from a practical point of view, since it means that the ran- domization can be performed at data-collection time. Thus, a trusted server is not required (as in k-anonymization) in order to perform the transformations on the records. This is a key advantage of randomization methods, though it comes at the expense that there are no guarantees against re-identification of the data in the presence of public information. Another key property of the randomization method is that the original records are not used after the trans- formation. Rather, the data mining algorithms use aggregate distributions of the data in order to perform the mining process. This paper is organized as follows. In the next section, we will discuss a number of reconstruction methods for randomization. We will also discuss the issue of optimaility and utility of randomization methods. In section 3, we will discuss a number of applications of randomization. We will show how the approach can be used for a number of applications such as classification and association rule mining. In section 4, we will discuss issues surrounding the A Survey of Randomization Methods for Privacy-Preserving Data Mining 139 quantification of privacy-preserving data mining algorithms. In section 5, we will discuss a number of adversarial attacks on the randomization method. In section 6, we discuss applications of the randomization method to the case of time series data. In section 7, we discuss the method of multiplicative pertur- bations and its applications to a variety of data mining algorithms. The conclu- sions and summary are presented in section 8. 6.2 Reconstruction Methods for Randomization In this section, we will discuss reconstruction algorithms for the randomiza- tion method. We note that the perturbed data distribution Z can be obtained by adding the distributions of the original data X and that of the perturbation Y. Therefore, we have: Z = X + Y X = Z − Y We note that only the distribution of Y is known explicit. The distribution of X is unknown, and N instantiations of the probability distribution Z are known. These N instantiations can be used to construct an estimate of the probability distribution Z. When the value of N is large, this estimate can be quite accurate. Once Z is known, we can subtract Y from it in order to obtain the probability distribution of X. For modest values of N, the errors in the estimation of Z can be quite large, and these errors may get magnified on subtraction of Y. Therefore, a more indirect method is desirable in order to estimate the probability distribution of X. A pair of closely related iterative methods have been discussed in [3, 5] for approximation of the corresponding probability distributions. The method in [3] uses the Bayes rule for distribution approximation, whereas that in [5] uses the EM method for distribution approximation. In this section, we will describe both methods. First, we will discuss the method in [3] for distribution reconstruction. 6.2.1 The Bayes Reconstruction Method Let f  and F  be the estimated density functions and cumulative density functions with the use of the reconstructed distributions. The, we can use the bayes formula in order to derive an estimate for f , using the first observed value z1: F (a)= a −∞ fX1 (w|X1 + Y1 = z1)dw (6.1) We can expand the above expression using the Bayes rule (in conjunction with the independence of the random variables Y and X) in order to construct the 140 Privacy-Preserving Data Mining: Models and Algorithms following expression for F (a). F (a)= a −∞ fX(z1 − w)· fX(w)dw ∞ −∞ fX(z1 − w)· fX(w)dw (6.2) We note that the above expression for F (a) was derived using a single ob- servation z1. In practice, the average distribution of multiple observations z1 ...zN can be used in order to construct the estimated cumulative distrib- ution F (a). Thus, we can construct the estimated distribution as follows: F (a)=(1/N )· N i=1 a −∞ fX(zi − w)· fX(w)dw ∞ −∞ fX(zi − w)· fX(w)dw (6.3) The corresponding density distribution can be obtained by differentiating F (a). This differentiation results in the removal of the integral sign from the numerator, and the corresponding instantiation of w to a. Therefore, we have: f (a)=(1/N )· N i=1 fX(zi − a)· fX(a) ∞ −∞ fX(zi − w)· fX(w)dw (6.4) We note that it is tricky to compute f(·) using the above equation, since we do not know the distribution for f on the right hand side. This suggests an iterative method for computing the distribution f. We start of by setting f as the uniform distribution, and iteratively update it using the equation above. The algorithm for computing f(a) for a particular value of a is described as follows: Set f to be the uniform distribution; repeat Update f(a)=(1/N )· N i=1 fX(zi−a)·fX(a) ∞ −∞ fX(zi−w)·fX(w)dw until convergence We note that we cannot compute the value of f(a) over all possible (infinite number of) values of a in a continuous domain. Therefore, we partition the domain of X into a number of intervals [l1,u1]...[ln,un], and assume that the function is uniform over each interval. For each interval [li,ui],thevalueofa in the above equation is picked to be (li + ui)/2. Thus, in each iteration, we use n different values of a corresponding to each of the intervals. We note that the density functions on the right hand sides can be computed using the mean values over the corresponding intervals. We note that the algorithm is terminated when the distribution does not change significantly over successive steps of the algorithm. A χ2 test was used to compare the two distributions. The implementation in [3] terminated the al- gorithm when the difference between successive estimates was given by 1% of A Survey of Randomization Methods for Privacy-Preserving Data Mining 141 the threshold of the χ2 test. While this algorithm is known to perform effec- tively in practice, the work in [3] does not prove this algorithm to be a provably convergent solution. In [5], an Expectation Maximization (EM) algorithm has been proposed which converges to a provably optimal solution. It is also shown in [5] that the Bayes algorithm of [3] is actually an approximation of the Ex- pectation Maximization algorithm proposed in [5]. This is one of the reasons why the Bayes method proposed in [3] is so robust in practice. 6.2.2 The EM Reconstruction Method In this subsection, we will discuss the EM algorithm for distribution recon- struction. Since the function fX(x) is defined over a continuous domain, we need to parameterize and discretize it for the purpose of any numerical estima- tion method. We assume that the data domain ΩX can be discretized into K intervals Ω1 ...ΩK,where∪k i=1Ωi =ΩX.Letmi = m(Ωi) be the length of the interval Ωi. We assume that fX(x) is constant over Ωi and the correspond- ing density function value is equal to θi. Thus, such a form will restrict fX(x) to a class parameterized by the finite set of parameters Θ={θ1,θ2,...,θK}. In order to explicitly denote the parametric dependence of the density function on Θ we will use the notation fX;Θ(x) for the density function of X.There- fore, we have fX;Θ(x)= K i=1 θiIΩi (x).HereIΩi (x)=1if x ∈ Ωi and 0 otherwise. Since fX;Θ(x) is a density, it follows that K i=1 θim(Ωi)=1.By choosing K large enough, density functions of the form discussed above can approximate any density function with arbitrary precision. After this parameterization, the algorithm will proceed to estimate Θ,and thereby determine ˆfX;Θ(x).LetΘ={ˆθ1, ˆθ2,...,ˆθK} be the estimate of these parameters produced by the reconstruction algorithm. Given a set of observations Z = z, we would ideally like to find the maximum-likelihood (ML) estimate ΘML =argmaxΘ ln fZ;Θ(z).TheML estimate has many attractive properties such as consistency, asymptotic unbi- asedness, and asymptotic minimum variance among unbiased estimates. How- ever, it is not always be possible to find ΘML directly, and this turns out to be the case with the fZ;Θ(z) given above. In order to achieve this goal, we will derive a reconstruction algorithm which fits into the broad framework of Expectation Maximization (EM) algorithms. The algorithm proceeds as if a more comprehensive set of data, say D = d is observable and maximizes ln fD;Θ(d) over all values of Θ(M-step). Since d is in fact unavailable, it replaces ln fD;Θ(d) by its conditional expected value given Z = z and the current estimate of Θ(E-Step). The D is chosen to make E-step and M-step easy to compute. In this paper, we propose the use of X = x as the more comprehensive set of data. As shown in the next section, this choice results in a computationally 142 Privacy-Preserving Data Mining: Models and Algorithms efficient algorithm. More formally, we define a Q function as follows: Q(Θ, Θ)=E ln fX;Θ(X) Z = z; Θ (6.5) Thus, Q(Θ, Θ) is the expected value of ln fX;Θ(X) computed with respect fX|Z=z; Θ, the density of X given Z = z and parameter vector Θ. After the initialization of Θ to a nominal value Θ0, the EM algorithm will iterate over the following two steps: 1 E-step: Compute Q(Θ,Θk). 2 M-step: Update Θk+1 =argmaxΘQ(Θ,Θk). The above discussion provides the general framework of EM algorithms; the actual details of the E-step and M-steps require a derivation which is prob- lem specific. Similarly, the precise convergence properties of an EM algorithm are rather sensitive to the problem and its corresponding derivation. In the next subsection, we will derive the EM algorithm for the reconstruction problem and show that the resulting EM-algorithm has desirable convergence proper- ties. The values of Q(Θ, Θ) during the E-step and the M-step of the recon- struction algorithm are discussed in [5]. Theorem 6.1 The value of Q(Θ, Θ) during the E-step of the reconstruction algorithm is given by: Q(Θ, Θ)= K i=1 ψi(z; Θ)lnθi,whereψi(z; Θ)= ˆθi N j=1 Pr(Y ∈zj−Ωi) fZ; Θ(zj ). In the next proposition, we calculate the value of Θ that maximizes Q(Θ, Θ). Theorem 6.2 The value of Θ which maximizes Q(Θ, Θ) during the M-step of the reconstruction algorithm is given by: θi = ψi(z; Θ) miN,whereψi(z; Θ)= ˆθi N j=1 Pr(Y ∈zj−Ωi) fZ; Θ(zj ). Now, we are in a position to describe the EM algorithm for the reconstruc- tion problem. 1. Initialize θ0 i = 1 K, i =1, 2,...,K; k =0; 2. Update Θ as follows: θ(k+1) i = ψi(z;Θk) miN; 3. k = k +1; 4. If not termination-criterion then return to Step 2. One key observation is that the EM algorithm is actually a refined version of the Bayes method discussed in [3]. The key difference between the two meth- ods is in how the approximation of the values within an interval is treated. A Survey of Randomization Methods for Privacy-Preserving Data Mining 143 While the Bayes method uses the crude estimate of the midpoint of the in- terval, the EM algorithm is more refined about it. While the Bayes method has not been shown to provably converge, it has been known to always em- pirically converge. On the other hand, our argument below shows that the EM algorithm does converge to a provably optimal solution. The close relationship between the two methods is the reason that the Bayes method is always known to empirically converge to an approximately optimal solution. The termination criterion for this method is based on how much Θk has changed since the last iteration. It has been shown in [5] that the EM algorithm converges to the true distribution of the random variable X. We summarize the result as follows: Theorem 6.3 The EM sequence {Θ(k)} for the reconstruction algorithm converges to the unique Maximum Likelihood Estimate ΘML. The above results lead to the following desirable property of the EM Algo- rithm. Observation 6.2.1 When there is a very large number of data observa- tions, then the EM algorithm provides zero information loss. This is because as the number of observations increases, ΘML ⇒ Θ.There- fore, the original and estimated distribution become the same (subject to the discretization needed for any numerical estimation algorithm), resulting in zero information loss. 6.2.3 Utility and Optimality of Randomization Models We note that the use of different perturbing distributions results in a differ- ent level of effectiveness of the randomization scheme. A key issue is how the randomization may be performed in order to optimize the tradeoff between pri- vacy and accuracy. Clearly, the provision of a higher level of accuracy for the same privacy level is desirable from the point of view of maintaining greater utility of the randomized data. In order to achieve this goal, the work in [30] de- fines a randomization scheme in which the noise added to a given observation depends upon the value of the underlying data record as well as a user-defined parameter. Thus, in this case, the noise is conditional on the value of the record itself. This is a more general and flexible model for the randomization process. We note that this approach still does not depend upon the behavior of the other records, and can therefore be performed at data collection time. Methods are defined in [30] in order to perform reconstruction of the data with the use of this kind of randomization. The reconstruction methods proposed in [30] are designed with the use of kernel estimators or iterative EM methods. In [30] a number of information loss and interval metrics are used to quantify the tradeoff between privacy and optimality. The approach explores the issue of optimizing the information loss within a privacy constraint, or optimizing the 144 Privacy-Preserving Data Mining: Models and Algorithms privacy within an information loss constraint. A number of simulations have been presented in [30] to illustrate the effectiveness of the approach. 6.3 Applications of Randomization The randomization method has been extended to a variety of data mining problems. In [3], it was discussed how to use the approach for classification. A number of other techniques [29, 30] have also been proposed which seem to work well over a variety of different classifiers. Techniques have also been pro- posed for privacy-preserving methods of improving the effectiveness of classi- fiers. For example, the work in [10] proposes methods for privacy-preserving boosting of classifiers. Methods for privacy-preserving mining of association rules have been proposed in [8, 24]. The problem of association rules is espe- cially challenging because of the discrete nature of the attributes corresponding to presence or absence of items. In order to deal with this issue, the random- ization technique needs to be modified slightly. Instead of adding quantitative noise, random items are dropped or included with a certain probability. The perturbed transactions are then used for aggregate association rule mining. This technique has shown to be extremely effective in [8]. The randomiza- tion approach has also been extended to other applications such as OLAP [4], and SVD based collaborative filtering [22]. We will discuss details of many of these techniques below. We note that a variety of other randomization schemes exist for privacy- preserving data mining. The above-mentioned scheme uses a single perturbing distribution in order to perform the randomization over the entire data. The randomization scheme can be tailored much more effectively by using mixture models [30] in order to perform the privacy-preservation. The work in [30] shows that this approach has a number of optimality properties in terms of the quality of the perturbation. 6.3.1 Privacy-Preserving Classification with Randomization A number of methods have been proposed for privacy-preserving classifica- tion with randomization. In [3], a method has been discussed for decision tree classification with the use of the aggregate distributions reconstructed from the randomized distribution. The key idea is to construct the distributions sepa- rately for the different classes. Then, the splitting condition for the decision tree uses the relative presence of the different classes which is derived from the aggregate distributions. It has been shown in [3] that such an approach can be used in order to design very effective classifiers. Since the probabilistic behavior is encoded in aggregate data distributions, it can be used to construct a naive Bayes classifier. In such a classifier [29], the A Survey of Randomization Methods for Privacy-Preserving Data Mining 145 approach of randomized response with partial hiding is used in order to per- form the classification. It has been shown in [29] that this approach is effective both empirically and analytically. 6.3.2 Privacy-Preserving OLAP In [4], a randomization algorithm for distributed privacy-preserving OLAP is discussed. In this approach, each client independently perturbs their data before sending it to a centralized server. The technique uses local perturbation techniques in which the perturbation added to an element depends upon its initial value. A variety of reconstruction techniques are discussed in order to respond to different kinds of queries. The key in such queries is to develop effective algorithms for estimating counts of different subcubes in the data. Such queries are typical in most OLAP applications. The approach has been shown in [4] to satisfy a number of privacy-breach guarantees. The method in [4] uses an interesting technique called retention replace- ment perturbation. In retention replacement perturbation, each element from column j is retained with probability pj, or replaced with an element from the selected pdf. It has been shown in [4] that approximate probabilistic recon- structability is possible when a least a certain number of rows are present in the data. Methods have also been devised in [4] to express the estimated query results on the perturbed table as a function of the query results on the perturbed table. Methods are devised in [4] to reconstruct the original distributed, single column aggregates, and multiple column aggregates. Techniques have also been devised on [4] for perturbation of categorical data sets. In this case, the retention-replacement approach needs to be modi- fied appropriately. In this case, the replacement approach is to use a random element to replace an element which is not retained. 6.3.3 Collaborative Filtering A variety of collaborative filtering techniques have been discussed in [22, 23]. The collaborative filtering problem is used in the context of electronic commerce when users choose to leave quantitative feedback (or ratings) about the products which they may like. In the collaborative filtering problem, we wish to make predictions of ratings of products for a particular user with the use of ratings of users with similar profiles. Such ratings are useful for making recommendations that the user may like. In [23], a correlation based collabo- rative filtering technique with randomization was proposed. In [22], an SVD based collaborative filtering method was proposed using randomized pertur- bation techniques. Since the collaborative filtering technique is inherently one in which ratings from multiple users are incorporated, we use a client-server 146 Privacy-Preserving Data Mining: Models and Algorithms mechanism in order to perform the perturbation. The broad approach of SVD- based collaborative filtering technique is as follows: The server decides on the nature (eg. uniform or Gaussian) of the per- turbing distribution along with the corresponding parameters. These pa- rameters are transmitted to each user. Each user computes the mean and z-number for their ratings. The entries which are not rated are substituted with the mean for the corresponding ratings and a z-number of 0. Each user then adds random number to all the ratings, and sends the disguised ratings to the server. The server receives the ratings from the different users and uses SVD on the disguised matrix in order to make predictions. 6.4 The Privacy-Information Loss Tradeoff The quantity used to measure privacy should indicate how closely the orig- inal value of an attribute can be estimated. The work in [3] uses a measure that defines privacy as follows: If the original value can be estimated with c% confidence to lie in the interval [α1,α2], then the interval width (α2 − α1) defines the amount of privacy at c% confidence level. For example, if the per- turbing additive is uniformly distributed in an interval of width 2α,thenα is the amount of privacy at confidence level 50% and 2α is the amount of privacy at confidence level 100%. However, this simple method of determining privacy can be subtly incomplete in some situations. This can be best explained by the following example. Example 6.4 Consider an attribute X with the density function fX(x) given by: fX(x)=0.5 0 ≤ x ≤ 1 0.5 4 ≤ x ≤ 5 0 otherwise Assume that the perturbing additive Y is distributed uniformly between [−1, 1]. Then according to the measure proposed in [3], the amount of privacy is 2 at confidence level 100%. However, after performing the perturbation and subsequent reconstruction, the density function fX(x) will be approximately revealed. Let us assume for a moment that a large amount of data is available, so that the distribution function is revealed to a high degree of accuracy. Since the (distribution of the) perturbing additive is publicly known, the two pieces of information can A Survey of Randomization Methods for Privacy-Preserving Data Mining 147 be combined to determine that if Z ∈ [−1, 2],thenX ∈ [0, 1]; whereas if Z ∈ [3, 6] then X ∈ [4, 5]. Thus, in each case, the value of X can be localized to an interval of length 1. This means that the actual amount of privacy offered by the perturbing additive Y is at most 1 at confidence level 100%. We use the qualifier ‘at most’ since X can often be localized to an interval of length less than one. For example, if the value of Z happens to be −0.5, then the value of X can be localized to an even smaller interval of [0, 0.5]. This example illustrates that the method suggested in [3] does not take into account the distribution of original data. In other words, the (aggregate) re- construction of the attribute value also provides a certain level of knowledge which can be used to guess a data value to a higher level of accuracy. To accu- rately quantify privacy, we need a method which takes such side-information into account. A key privacy measure [5] is based on the differential entropy of a random variable. The differential entropy h(A) of a random variable A is defined as follows: h(A)=− ΩA fA(a)log2 fA(a) da (6.6) where ΩA is the domain of A. It is well-known that h(A) is a measure of uncertainty inherent in the value of A[111]. It can be easily seen that for a random variable U distributed uniformly between 0 and a, h(U)=log2(a). For a =1, h(U)=0. In [5], it was proposed that 2h(A) is a measure of privacy inherent in the random variable A. This value is denoted by Π(A). Thus, a random variable U distributed uniformly between 0 and a has privacy Π(U)=2log2(a) = a.Fora general random variable A, Π(A) denote the length of the interval, over which a uniformly distributed random variable has the same uncertainty as A. Given a random variable B,theconditional differential entropy of A is defined as follows: h(A|B)=− ΩA,B fA,B(a, b)log2 fA|B=b(a) da db (6.7) Thus, the average conditional privacy of A given B is Π(A|B)=2h(A|B).This motivates the following metric P(A|B) for the conditional privacy loss of A, given B: P(A|B)=1− Π(A|B)/Π(A)=1− 2h(A|B)/2h(A) =1− 2−I(A;B). where I(A;B)=h(A) − h(A|B)=h(B) − h(B|A).I(A;B) is also known as the mutual information between the random variables A and B. Clearly, P(A|B) is the fraction of privacy of A which is lost by revealing B. 148 Privacy-Preserving Data Mining: Models and Algorithms As an illustration, let us reconsider Example 6.4 given above. In this case, the differential entropy of X is given by: h(X)=− ΩX fX(x)log2 fX(x) dx = = − 1 0 0.5log2 0.5 dx − 5 4 0.5log2 0.5 dx =1 Thus the privacy of X, Π(X)=21 =2. In other words, X hasasmuchprivacy as a random variable distributed uniformly in an interval of length 2. The den- sity function of the perturbed value Z is given by fZ(z)= ∞ −∞ fX(ν)fY(z − ν) dν. Using fZ(z), we can compute the differential entropy h(Z) of Z. It turns out that h(Z)=9/4. Therefore, we have: I(X;Z)=h(Z) − h(Z|X)=9/4 − h(Y)=9/4 − 1=5/4 Here, the second equality h(Z|X)=h(Y) follows from the fact that X and Y are independent and Z = X + Y. Thus, the fraction of privacy loss in this case is P(X|Z)=1− 2−5/4 =0.5796. Therefore, after revealing Z,X has privacy Π(X|Z)=Π(X) × (1 −P(X|Z)) = 2 × (1.0 − 0.5796) = 0.8408. This value is less than 1, since X can be localized to an interval of length less than one for many values of Z. Given the perturbed values z1,z2,...,zN,it is (in general) not possible to reconstruct the original density function fX(x) with an arbitrary precision. The greater the variance of the perturbation, the lower the precision in estimating fX(x). This constitutes the classic tradeoff between privacy and information loss. We refer the lack of precision in esti- mating fX(x) as information loss. Clearly, the lack of precision is estimating the true distribution will degrade the accuracy of the application that such a dis- tribution is used for. The work in [3] uses an application dependent approach to measure the information loss. For example, for a classification problem, the inaccuracy in distribution reconstruction is measured by examining the effects on the mis-classification rate. The work in [5] uses a more direct approach to measure the information loss. Let ˆfX(x) denote the density function of X as estimated by a reconstruction algorithm. We propose the metric I(fX, ˆfX) to measure the information loss incurred by a reconstruction algorithm in estimating fX(x): I(fX, ˆfX)=1 2E ΩX fX(x) − ˆfX(x) dx (6.8) Thus the proposed metric equals half the expected value of L1-norm between the original distribution fX(x) and its estimate ˆfX(x). Note that information A Survey of Randomization Methods for Privacy-Preserving Data Mining 149 A B CD Original Distribution Estimated Distribution In this case, the estimated distribution is somewhat shifted from the original distribution. Information Loss is the amount of mismatch between the two curves in terms of area. This is equal to half the sum of the areas of A, B, C and D. and is also equal to 1 - Area shared by both curves. Figure 6.1. Illustration of the Information Loss Metric loss I(fX, ˆfX) lies between 0 and 1; I(fX, ˆfX)=1implies perfect recon- struction of fX(x) and I(fX, ˆfX)=0implies that there is no overlap between fX(x) and its estimate ˆfX(x)(see Figure 6.1). The proposed metric is univer- sal in the sense that it can be applied to any reconstruction algorithm since it depends only on the original density fX(x), and its estimate ˆfX(x). We advo- cate the use of a universal metric since it is independent of the particular data mining task at hand, and therefore facilitates absolute comparisons between disparate reconstruction algorithms. 6.5 Vulnerabilities of the Randomization Method In the earlier section on privacy quantification, we illustrated an example in which the reconstructed distribution on the data can be used in order to reduce the privacy of the underlying data record. In general, a systematic approach can be used to do this in multi-dimensional data sets with the use of spec- tral filtering or PCA based techniques [11, 14]. The broad idea in techniques such as PCA [11] is that the correlation structure in the original data can be estimated fairly accurately (in larger data sets) even after noise addition. This is because the noise is added to each dimension independently, and it does not affect the expected covariance between different pairs of attributes. Only the variance of the attributes is affected, and the change in variance can be esti- mated accurately from the public information about the perturbing distribution. To understand this point, consider the case when the noise variable Y1 is added to the first column X1, and the noise variable Y2 is added to the second column X2. Then, we have: covariance((X1 + Y1)·(X2 + Y2)) = covariance(X1 ·X2) variance((X1 + Y1)) = variance(X1)+variance(Y1) Both results can be derived by expanding the expressions and using the fact that the covariance between either of {X1,X2} with either of {Y1,Y2} is zero and that covariance(Y1,Y2)=0. This is because it is assumed that the noise is added independently to each dimension. Therefore, the covariance of Y1 and Y2 with each other or the original data columns is zero. Furthermore, the variances of Y1 and Y2 are known, since the corresponding distributions are 150 Privacy-Preserving Data Mining: Models and Algorithms publicly known. This means that the covariance matrix of the perturbed data can be used to derive the covariance matrix of the original data by simply mod- ifying the diagonal entries. Once the covariance matrix of the original data has been estimated, one can then try to remove the noise in the data in such a way that it fits the aggregate correlation structure of the data. For example, the data is expected to be distributed along the eigenvectors of this covariance matrix, so that the variance along these eigenvectors are given by the corresponding eigenvalues. Since real data usually shows considerable skew in the eigenvalue structure, it is often the case that the entire data set of a few hundred dimen- sions can be captured on a plane containing less than 20 to 30 eigenvectors. In such cases, it is apparent that the data points which deviate significantly from this much lower dimensional plane need to be projected back onto it in order to derive the original data. It has been shown in [11] that such an approach can reconstruct the data quite accurately. Furthermore, we note that the accuracy of this kind of approach increases with the size of the data set, and the relation- ship of the intrinsic dimensionality to the full dimensionality of the data set. A related method in [14] uses spectral filtering in order to reconstruct the data accurately. It has been shown that such techniques can reduce the privacy of the perturbation process significantly since the noise removal results in values which are fairly close to their original values [11, 14]. The approach is particu- larly effective in cases where the data is embedded in a much lower intrinsic di- mensionality as compared to its true dimensionality. It has been shown in [11] that the addition of noise along the eigenvectors of the data is safer from the point of view of privacy-preservation. This is because the discrepancy between the behavior of individual randomized points with the correlation structure of the data may no longer be used for reconstruction. Some other discussions on limiting breaches of privacy in the randomization method may be found in [7]. A second kind of adversarial attack is with the use of public informa- tion [1]. While the PCA-approach is good for value-reconstruction, it does not say much about identification of the subject of a record. Both value- reconstruction and subject-identification are required in adversarial attacks. For this purpose, it is possible to use public data in order to try to deter- mine the identity of the subject. Consider a record X =(x1 ...xd),which is perturbed to Z =(z1 ...zd). Then, since the distribution of the pertur- bations is known, we can try to use a maximum likelihood fit of the poten- tial perturbation of Z to a public record. Consider the publicly public record W =(w1 ...wd). Then, the potential perturbation of Z with respect to W is given by (Z − W)=(z1 − w1 ...zd − wd). Each of these values (zi − wi) should fit the distribution fY(y). The corresponding log-likelihood fit is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood fit, the greater the probability that the record W corresponds to X. If it is known that the public data set always includes X, then the maximum likelihood fit can provide a high A Survey of Randomization Methods for Privacy-Preserving Data Mining 151 degree of certainty in identifying the correct record, especially in cases where d is large. Another result in [10] suggests that the use of different perturbing distributions can have significant effects on the privacy of the underlying data. For example, the use of uniform perturbations is experimentally shown to be more effective in the low dimensional case. However, for the high dimensional case, gaussian perturbations are more effective. The work in [10] characterizes the amount of perturbation required for a particular dimensionality with each kind of perturbing distribution. For the case of gaussian distributions, the stan- dard deviation of the perturbation needs to increases with the square-root of the implicit dimensionality, and for the case of uniform distributions, the standard deviation of the perturbation increases at least linearly with the implicit dimen- sionality. In either case, both kinds of perturbations tend to become ineffective with increasing dimensionality. 6.6 Randomization of Time Series Data Streams The randomization approach is particularly well suited to privacy- preserving data mining of streams, since the noise added to a given record is independent of the rest of the data. However, streams provide a particularly vulnerable target for adversarial attacks with the use of PCA based techniques [11] because of the large volume of the data available for analysis. In addi- tion, there are typically auto-correlations among the different components of a series. Such auto-correlations can also be used for reconstruction purposes. In [28], an interesting technique for randomization has been proposed which uses the correlations and auto-correlations in different time series while decid- ing the noise to be added to any particular value. The key idea for the case of correlated noise is to use a similar idea as in [11] in order to use princi- pal component analysis to determine the directions in which the second order correlations are zero. These principal components are the eigenvectors of the covariance matrix for the data. Then, the noise is added along these princi- pal components (or eigenvectors) rather than the original space. This ensures that it is extremely difficult to reconstruct the data using correlation analysis. This approach is effective for the case of correlations across multiple streams, but not auto-correlations within a single stream. In the case of dynamic auto- correlations, we are dealing with the case when there are correlations within a single stream at different local time instants. Such correlations can also be removed by treating a window of the stream at one time, and performing the principal components analysis on all the components of the window. Thus, we are using essentially the same idea, except that we are using multiple time in- stants of the sane stream to construct the co-variance matrix. The ideas can in fact be combined when there are both correlations and auto-correlations by using multiple time-instants from all streams, in order to create one covariance 152 Privacy-Preserving Data Mining: Models and Algorithms matrix. This will also capture correlations between different streams at slightly displaced time instants. Such situations are are referred to as lag correlations, and are quite common in data streams when slight changes in one stream pre- cede changes in another because of the same cause. In many cases, the directions of correlations may change over time. If a static approach is used for randomization, then the changes in the correlation structure will result in a risk of the data becoming exposed over time, when the principal components have changed sufficiently. Therefore, the technique in [28] is designed to dynamically adjust the directions of correlation as more and more points from the data stream are received. It has been shown in [28] that such an approach is more robust since the noise correlates with the stream behavior, and it is more difficult to create effective adversarial attacks with the use of correlation analysis techniques. 6.7 Multiplicative Noise for Randomization The most common method of randomization is that of additive perturba- tions. However, multiplicative perturbations can also be used to good effect for privacy-preserving data mining. Many of these techniques derive their roots in the work of [13] which shows how to use multi-dimensional projections in or- der to reduce the dimensionality of the data. This technique preserves the inter- record distances approximately, and therefore the transformed records can be used in conjunction with a variety of distance-intensive data mining applica- tions. In particular, the approach is discussed in detail in [20, 21], in which it is shown how to use the method for privacy-preserving clustering. The technique can also be applied to the problem of classification as discussed in [28]. We note that both clustering and classification are locality specific problems, and are therefore particularly well suited to the multiplicative perturbation tech- nique. One key difference between the use of additive and multiplicative per- turbations is that in the former case, we can reconstruct only aggregate distri- butions, whereas in the latter case more record-specific information (eg. dis- tances) are preserved. Therefore, the latter technique is often more friendly to different kinds of data mining techniques. Multiplicative perturbations can also be used for distributed privacy- preserving data mining. Details can be found in [17]. In [17], a number of key assumptions have also been discussed, which ensure that privacy is preserved. These assumptions discuss the level of privacy when the attacker knows par- tial characateristics about the algorithm used to perform the transformation, or other statistics associated with the transformation. The effects of using special kinds of data (eg. boolean data) are also discussed. A number of techniques for multiplicative perturbation in the context of masking census data may be found in [15]. A variation on this theme may A Survey of Randomization Methods for Privacy-Preserving Data Mining 153 be implemented with the use of distance preserving fourier transforms, which work effectively for a variety of cases [19]. 6.7.1 Vulnerabilities of Multiplicative Randomization As in the case of additive perturbations, multiplicative perturbations are not entirely safe from adversarial attacks. In general, if the attacker has no prior knowledge of the data, then it is relatively difficult to attack the privacy of the transformation. However, with some prior knowledge, two kinds of attacks are possible [18]: Known Input-Output Attack: In this case, the attacker knows some linearly independent collection of records, and their corresponding per- turbed version. In such cases, linear algebra techniques can be used to reverse-engineer the nature of the privacy preserving transformation. The number of records required depends upon the dimensionality of the data and the available records. The probability of a privacy breach with a given sample size is characterized in [18]. Known Sample Attack: In this case, the attacker has a collection of independent data samples from the same distribution from which the original data was drawn. In such cases, principal component analysis techniques can be used in order to reconstruct the behavior of the original data. Then, one can try to determine how the current random projection of the data relates to this principal component analysis. This can provide an approximate idea of the corresponding geometric transformation. One observation is that both the above mentioned techniques require much more samples (or background knowledge) to work effectively in the high di- mensional case. Thus, random projection techniques should generally be used for the case of high dimensional data, and only a smaller number of projections should be retained in order to preserve privacy. Thus, as with the additive per- turbation technique, the multiplicative technique is not completely secure from attacks. A key research direction is to use a combination of additive and mul- tiplicative perturbation techniques in order to construct more robust privacy- preservation techniques. 6.7.2 Sketch Based Randomization A closely related case to the use of multiplicative perturbations is the use of sketch-based randomization. In sketch based randomization [2], we use sketches in order to construct the randomization from the data set. We note that sketches are a special case of multiplicative perturbation techniques in the sense that the individual components of the multiplicative vector are drawn from {−1, +1}. Sketches are particularly useful for the case of sparse data 154 Privacy-Preserving Data Mining: Models and Algorithms such as text or binary data in which most components are zero and only a few components are non-zero. Furthermore, sketches are designed in such a way that many aggregate properties such as the dot product can be estimate very accurately from a small number of constant components. Since text and mar- ket basket data are both high-dimensional, the use of random projections is particularly effective from the point of view of adversarial attacks. In [11], it as been shown how the method of sketches can be used in order to perform effective privacy-preserving data mining of text and market basket data. It is possible to use sketches to create a scheme which is similar to random- ization in the sense that the transformation of a given record can be performed at data collection time. It is possible to control the anonymization in such a way so that the absolute variance of the randomization scheme is preserved. If desired, it is also possible to use sketches to add noise so that records cannot be distinguished easily from their k-nearest neighbors. This is a similar model to the k-anonymity model, but comes at the expense of using a trusted server for anonymization. 6.8 Conclusions and Summary In this chapter, we discussed the randomization method for privacy- preserving data mining. We discussed a number of different algorithms for randomization, such as the Bayes method and the EM reconstruction tech- nique. The EM-reconstruction algorithm also exhibits a number of optimality properties with respect to its convergence to the maximum likelihood estimate of the data distribution. We also discussed a number of variants of the pertur- bation technique such as the method of multiplicative perturbations. A number of applications of the randomization method were discussed over a variety of data mining problems. References [1] Aggarwal C. C.: On Randomization, Public Information and the Curse of Dimensionality. ICDE Conference, 2007. [2] Aggarwal C. C., Yu P. S.: On Privacy-Preservation of Text and Sparse Binary Data with Sketches. SIAM Conference on Data Mining, 2007. [3] Agrawal R., Srikant R. Privacy-Preserving Data Mining. Proceedings of the ACM SIGMOD Conference, 2000. [4] Agrawal R., Srikant R., Thomas D. Privacy-Preserving OLAP. Proceed- ings of the ACM SIGMOD Conference, 2005. [5] Agrawal D. Aggarwal C. C. On the Design and Quantification of Privacy- Preserving Data Mining Algorithms. ACM PODS Conference, 2002. A Survey of Randomization Methods for Privacy-Preserving Data Mining 155 [6] Chen K., Liu L.: Privacy-preserving data classification with rotation per- turbation. ICDM Conference, 2005. [7] Evfimievski A., Gehrke J., Srikant R. Limiting Privacy Breaches in Pri- vacy Preserving Data Mining. ACM PODS Conference, 2003. [8] Evfimievski A., Srikant R., Agrawal R., Gehrke J.: Privacy-Preserving Mining of Association Rules. ACM KDD Conference, 2002. [9] Fienberg S., McIntyre J.: Data Swapping: Variations on a Theme by Dale- nius and Reiss. Technical Report, National Institute of Statistical Sci- ences, 2003. [10] Gambs S., Kegl B., Aimeur E.: Privacy-Preserving Boosting. Knowledge Discovery and Data Mining Journal, to appear. [11] Huang Z., Du W., Chen B.: Deriving Private Information from Random- ized Data. pp. 37–48, ACM SIGMOD Conference, 2005. [12] Warner S. L. Randomized Response: A survey technique for eliminat- ing evasive answer bias. Journal of American Statistical Association, 60(309):63–69, March 1965. [13] Johnson W., Lindenstrauss J.: Extensions of Lipshitz Mapping into Hilbert Space, Contemporary Math. vol. 26, pp. 189–206, 1984. [14] Kargupta H., Datta S., Wang Q., Sivakumar K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. ICDM Conference, pp. 99–106, 2003. [15] Kim J., Winkler W.: Multiplicative Noise for Masking Continuous Data, Technical Report Statistics 2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [16] Liew C. K., Choi U. J., Liew C. J. A data distortion by probability distri- bution. ACM TODS, 10(3):395–411, 1985. [17] Liu K., Kargupta H., Ryan J.: Random Projection Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering, 18(1), 2006. [18] Liu K., Giannella C., Kargupta H.: An Attacker’s View of Distance Pre- serving Maps for Privacy-Preserving Data Mining. PKDD Conference, 2006. [19] Mukherjee S., Chen Z., Gangopadhyay S.: A privacy-preserving tech- nique for Euclidean distance-based mining algorithms using Fourier based transforms, VLDB Journal, 2006. [20] Oliveira S. R. M., Zaane O.: Privacy Preserving Clustering by Data Trans- formation, Proc. 18th Brazilian Symp. Databases, pp. 304–318, Oct. 2003. 156 Privacy-Preserving Data Mining: Models and Algorithms [21] Oliveira S. R. M., Zaiane O.: Data Perturbation by Rotation for Privacy- Preserving Clustering, Technical Report TR04-17, Department of Com- puting Science, University of Alberta, Edmonton, AB, Canada, August 2004. [22] Polat H., Du W.: SVD-based collaborative filtering with privacy. ACM SAC Symposium, 2005. [23] Polat H., Du W.: Privacy-preserving collaborative filtering with random- ized perturbation techniques. ICDM Conference, 2003. [24] Rizvi S., Haritsa J.: Maintaining Data Privacy in Association Rule Min- ing. VLDB Conference, 2002. [25] Samarati P.: Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 13(6): 1010–1027 (2001). [26] Shannon C. E.: The Mathematical Theory of Communication, University of Illinois Press, 1949. [27] Silverman B. W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [28] Li F., Sun J., Papadimitriou S., Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. [29] Zhang P., Tong Y., Tang S., Yang D.: Privacy-Preserving Naive Bayes Classifier. Lecture Notes in Computer Science, Vol 3584, 2005. [30] Zhu Y., Liu L. Optimal Randomization for Privacy- Preserving Data Min- ing. ACM KDD Conference, 2004. Chapter 7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining Keke Chen College of Computing Georgia Institute of Technology kekechen@cc.gatech.edu Ling Liu College of Computing Georgia Institute of Technology lingliu@cc.gatech.edu Abstract The major challenge of data perturbation is to achieve the desired balance be- tween the level of privacy guarantee and the level of data utility. Data privacy and data utility are commonly considered as a pair of conflicting requirements in privacy-preserving data mining systems and applications. Multiplicative per- turbation algorithms aim at improving data privacy while maintaining the de- sired level of data utility by selectively preserving the mining task and model specific information during the data perturbation process. By preserving the task and model specific information, a set of “transformation-invariant data mining models” can be applied to the perturbed data directly, achieving the required model accuracy. Often a multiplicative perturbation algorithm may find multiple data transformations that preserve the required data utility. Thus the next major challenge is to find a good transformation that provides a satisfactory level of privacy guarantee. In this chapter, we review three representative multiplicative perturbation methods: rotation perturbation, projection perturbation, and geo- metric perturbation, and discuss the technical issues and research challenges. We first describe the mining task and model specific information for a class of data mining models, and the transformations that can (approximately) preserve the information. Then we discuss the design of appropriate privacy evaluation models for multiplicative perturbations, and give an overview of how we use the privacy evaluation model to measure the level of privacy guarantee in the context of different types of attacks. Keywords: Multiplicative perturbation, random projection, sketches. 158 Privacy-Preserving Data Mining: Models and Algorithms 7.1 Introduction Data perturbation refers to a data transformation process typically per- formed by the data owners before publishing their data. The goal of perform- ing such data transformation is two-fold. On one hand, the data owners want to change the data in a certain way in order to disguise the sensitive information contained in the published datasets, and on the other hand, the data owners want the transformation to best preserve those domain-specific data properties that are critical for building meaningful data mining models, thus maintaining mining task specific data utility of the published datasets. Data perturbation techniques are one of the most popular models for pri- vacy preserving data mining. It is especially useful for applications where data owners want to participate in cooperative mining but at the same time want to prevent the leakage of privacy-sensitive information in their published datasets. Typical examples include publishing micro data for research purpose or out- sourcing the data to the third party data mining service providers. Several per- turbation techniques have been proposed to date [4–1, 8, 3, 13, 14, 26, 35], among which the most popular one is the randomization approach that focuses on single-dimensional perturbation and assumes independency between data columns [4, 13]. Only recently, the data management community has shown some development on multi-dimensional data perturbation techniques, such as the condensation approach using k-nearest neighbor (kNN) method [1], the multi-dimensional K-anonymization using kd-tree [24], and the multiplica- tive data perturbation techniques [31, 8, 28, 9]. Compared to single-column- based data perturbation techniques that assume data columns to be independent and focus on developing single-dimensional perturbation techniques, multi- dimensional data perturbation aims at perturbing the data while preserving the multi-dimensional information with respect to inter-column dependency and distribution. In this chapter, we will discuss multiplicative data perturbations. This cate- gory includes three types of particular perturbation techniques: Rotation Per- turbation, Projection Perturbation, and Geometric Perturbation. Comparing to other multi-dimensional data perturbation methods, these perturbations exhibit unique properties for privacy preserving data classification and data cluster- ing. They all preserve (or approximately preserve) distance or inner product, which are important to many classification and clustering models. As a result, the classification and clustering mining models based on the perturbed data through multiplicative data perturbation show similar accuracy to those based on the original data. The main challenge for multiplicative data perturbations thus is how to maximize the desired data privacy. In contrast, many other data perturbation techniques focus on seeking for the better trade-off between the Multiplicative Perturbations for Privacy 159 level of data utility and accuracy preserved and the level of data privacy guar- anteed. 7.1.1 Data Privacy vs. Data Utility Perturbation techniques are often evaluated with two basic metrics: level of privacy guarantee and level of model-specific data utility preserved, which is often measured by the loss of accuracy for data classification and data clus- tering. An ultimate goal for all data perturbation algorithms is to optimize the data transformation process by maximizing both data privacy and data utility achieved. However, the two metrics are typically representing two conflicting goals in many existing perturbation techniques [4, 3, 12–1]. Data privacy is commonly measured by the difficulty level in estimating the original data from the perturbed data. Given a data perturbation technique, the higher level of difficulty in which the original values can be estimated from the perturbed data, the higher level of data privacy this technique supports. In [4], the variance of the added random noise is used as the level of difficulty for estimating the original values as traditionally used in statistical data distortion [23]. However, recent research [12, 3] reveals that variance of the noise is not an effective indicator for random noise addition. In addition, [22] shows that the level of data privacy guaranteed is also bounded to the types of special attacks that can reconstruct the original data from the perturbed data and noise distribution. k-Anonymization is another popular way of measuring the level of privacy, originally proposed for relational databases [34], by enabling the effective estimation of the original data record to a k-record group, assuming that each record in the k-record group is equally protected. However, recent study [29] shows that the privacy evaluation of k-Anonymized records is far more complicated than this simple k-anonymization assumption. Data utility typically refers to the amount of mining-task/model specific crit- ical information preserved about the dataset after perturbation. Different data mining tasks, such as classification mining task vs. association rule mining, or different models for the same task, such as decision tree model vs. k-Nearest- Neighbor (kNN) classifier for classification, typically utilize different sets of data properties about the dataset. For example, the task of building decision trees primarily concerns the column distribution. Hence, the quality of pre- serving column distribution should be the key data utility to be maintained in perturbation techniques for decision tree model, as shown in the random- ization approach [4]. In comparison, the kNN model relies heavily on the distance relationship, which is quite different from the column distribution. Furthermore, such task/model-specific information is often multidimensional. Many classification models typically concern the multidimensional informa- tion rather than single column distribution. Multi-dimensional perturbation 160 Privacy-Preserving Data Mining: Models and Algorithms techniques with the focus on preserving the model-specific multidimensional information will be more effective for these models. It is also interesting to note that the data privacy metric and the data utility metric are often contradictory rather than complimentary in many existing data perturbation techniques [4, 3, 12–1]. Typically data perturbation algorithms that aim at maximizing the level of data privacy often have to bear with higher information loss. The intrinsic correlation between the data privacy and the data utility raises a number of important issues regarding how to find a right balance between the two measures. In summary, we identify three important design principles for multiplicative data perturbations. First, preserving the mining task and model-specific data properties is critical for providing better quality guarantee on both privacy and model accuracy. Second, it is beneficial if data perturbation can effectively pre- serve the task/model-specific data utility information, and avoid the need for developing special mining algorithms that can use the perturbed data as ran- dom noise addition requires. Third and most importantly, if one can develop a data perturbation technique that does not induce any lost of mining-task/model specific data utility, this will enable us to focus on optimizing perturbation algorithms by maximizing the level of data privacy against attacks, which ulti- mately leads to better overall quality of both data privacy and data utility. 7.1.2 Outline In the remaining of the chapter we will first give the definition of multi- plicative perturbation in Section 7.2. Specifically, we categorize multiplicative perturbations into three categories: rotation perturbation, projection perturba- tion, and geometric perturbation. Rotation perturbation is often criticized not resilient to attacks, while geometric perturbation is a direct enhancement to rotation perturbation by adding more components, such as translation pertur- bation and noise addition, to the original rotation perturbation. Both rotation perturbation and geometric Perturbation keep the dimensionality of dataset un- changed, while projection perturbation reduces the dimensionality, and thus incurs more errors in distance or inner product calculation. One of the unique features that distinguish multiplicative perturbations from other perturbations is that it provides high guarantee on data utility in terms of data classification and clustering. Since many data mining models utilize dis- tance or inner product, as long as such information is preserved, models trained on perturbed data will have similar accuracy to those trained on the original data. In Section 7.3, we define transformation-invariant classifiers and cluster- ing models, the representative models to which multiplicative perturbations are applied. Multiplicative Perturbations for Privacy 161 Evaluation of privacy guarantee for perturbations is an important component in the analysis of multiplicative perturbation. In Section 7.4, we review a set of privacy metrics specifically designed for multiplicative perturbations. We argue that in multidimensional perturbation, the values of multiple columns should be perturbed together and the evaluation metrics should be unified for all columns. We also describe a general framework for privacy evaluation of multiplicative data perturbation by incorporating attack analysis. We argue that attack analysis is a necessary step in order to accurately eval- uate the privacy guarantee of any particular perturbation. In Section 7.5, we review a selection of known attacks to multiplicative perturbations based on different levels of attack’s knowledge about the original dataset. By incorpo- rating attack analysis under the general framework of privacy evaluation, a ran- domized perturbation optimization is developed and described in Section 7.5.5. 7.2 Definition of Multiplicative Perturbation We will first describe the notations used in this chapter, and then describe three categories of multiplicative perturbations and their basic characteristics. 7.2.1 Notations In privacy-preserving data mining, either a portion of or the entire data set will be perturbed and then exported. For example, in classification, the training data is exported and the testing data might be exported, too, while in clustering, the entire data for clustering is exported. Suppose that X is the exported dataset consisting of N data rows (records) and d columns (attributes, or dimensions). For presentation convenience, we use Xd×N,X =[x1 ...xN], to denote the dataset, where a column xi (1 ≤ i ≤ N) is a data tuple, representing a vector in the real space Rd. In classification, each of such data tuples xi also belongs to a predefined class, which is indicated by the class label attribute yi.The class label can be nominal (or continuous for regression), and is public, i.e., privacy-insensitive. For clear presentation, we can also consider X is a sample dataset from the d-dimension random vector X =[X1,X2,...,Xd]T. As a convention, we use bold lower case to represent vectors, bold upper case to represent random variables, and upper case to represent matrices or datasets. 7.2.2 Rotation Perturbation This category does not cover traditional “rotations” only, but literally, it in- cludes all orthonormal perturbations. A rotation perturbation is defined as following G(X): G(X)=RX 162 Privacy-Preserving Data Mining: Models and Algorithms The matrix Rd×d is an orthonormal matrix [32], which has following proper- ties. Let RT represent the transpose of R, rij represent the (i, j) element of R,andI be the identity matrix. The rows and columns of R are orthonormal, i.e., for any column j, d i=1 r2 ij =1, and for any two columns j and k, j = k,d i=1 rijrik =0. A similar property is held for rows. This definition infers that RTR = RRT = I It also implies that by changing the order of the rows or columns of an orthog- onal matrix, the resulting matrix is still orthogonal. A random orthonormal matrix can be efficiently generated following the Haar distribution [33]. A key feature of rotation transformation is that it preserve the Euclidean dis- tance of multi-dimensional points during the transformation. Let xT represent the transpose of vector x,andx = xT x represent the length of a vector x. By the definition of rotation matrix, we have Rx = x Similarly, inner product is also invariant to rotation. Let x, y = xT y repre- sent the inner product of x and y.Wehave Rx,Ry = xTRTRy = x, y In general, rotation also preserves the geometric shapes such as hyperplane and hyper curved surface in the multidimensional space [7]. We observed that since many classifiers look for geometric decision boundary, such as hyper- plane and hyper surface, rotation transformation will preserve the most critical information for many classification models. There are two ways to apply rotation perturbation. We can either apply it to the whole dataset X[8], or group columns to pairs and apply different rotation perturbations to different pairs of columns [31]. 7.2.3 Projection Perturbation Projection perturbation refers to the technique of projecting a set of data points from a high-dimensional space to a randomly chosen lower-dimensional sub- space. Let Pk×d be a projection matrix. G(X)=PX Why can it also be used for perturbation? The rationale is based on the Johnson-Lindenstrauss Lemma [21]. Theorem 1 For any 0 <<1 and any integer n,letk be a positive integer such that k ≥ 4lnn 2/2−3/3 . Then, for any set S of n data points in d dimensional Multiplicative Perturbations for Privacy 163 space Rd, there is a map f:Rd → Rk such that, for all x ∈ S, (1 − )x − x2 ≤f(x) − f(x)2 ≤ (1 + )x − x2 where ·denotes the vector 2-norm. This lemma shows that any set of n points in d-dimensional Euclidean space could be embedded into a O(log n 2 )-dimensional space, such that the pair-wise distance of any two points are maintained with small error. With large n (large dataset) and small  (high accuracy in distance preservation), the ideal dimen- sionality might be large and may not be practical for the perturbation purpose. Furthermore, although this lemma implies that we can always find one good projection that approximately preserves distances for a particular dataset, the geometric decision boundary might still be distorted and thus the model ac- curacy is reduced. Due to the different distributions of dataset and particular properties of data mining models, it is challenging to develop an algorithm that can find random projections that preserves model accuracy well for any given dataset. In paper [28] a method is used to generate random projection matrix. The process can be briefly described as follows. Let P be the projection matrix. Each entry ri,j of P is independent and identically chosen from some distrib- ution with mean zero and variance σ2. A row-wise projection is defined as G(X)= 1√ kσ PX Let x and y be two points in the original space, and u and v be their projec- tions. The statistical properties of inner product under projection perturbation can be shown as follows. E[utv − xty]=0 and Var[utv − xty]=1 k( i x2 i i y2 i +( i xiyi)2) Since x and y are not normalized by rows, but by columns in practice, with large dimensionality d and relatively small k, the variance is substantial. Simi- larly, the conclusion can be extended to the distance relationship. Therefore, projection perturbation does not strictly guarantee the preservation of dis- tance/inner product as rotation or geometric perturbation does, which may sig- nificantly downgrade the model accuracy. 164 Privacy-Preserving Data Mining: Models and Algorithms 7.2.4 Sketch-based Approach Sketch-based approach is primarily proposed to perturb high-dimensional sparse data [2], such as the datasets in text mining and market basket mining. A sketch of the original record x =(x1,...,xd) is defined by a r dimensional vector s =(s1,...,sr), r d,where sj = d i=1 xirij The random variable rij is drawn from {-1,+1} with a mean of 0, and is gen- erated from a pseudo-random number generator [5], which produces 4-wise independent values for the variable rij. Note that the sketch based approach defers from projection perturbation with the following two features. First, the number of components for each sketch, i.e., r, can vary across different records, and is carefully controlled so as to provide a uniform measure of privacy guarantee across different records. Second, for each record, rij is different − there is no fixed projection matrix across records. The sketch based approach has a few statistical properties that enable ap- proximate calculation of dot product of the original data records with their sketches. Let s and t with the same number of components r, be the sketches of the original records x and y, respectively. The expected dot product x and y is given by the following. E[x, y]=s, t/r and the variance of the above estimation is determined by the few non-zeros entries in the sparse original vectors Var(s, t/r)=( d i=1 d l=1 x2 i y2 l − ( d i=1 xiyi)2)/r (7.1) On the other side, the original value xk in the vector x can also be esti- mated by privacy attackers, the precision of which is determined by its variance ( d i=1 x2 i −x2 k)/r, k =1...d. The larger the variance is, the better the original value is protected. Therefore, by decreasing r the level of privacy guarantee is possibly increased. However, the precision of dot-product estimation (Eq. 7.1) is decreased. This typical tradeoff has to be carefully controlled in practice [2]. 7.2.5 Geometric Perturbation Geometric perturbation is an enhancement to rotation perturbation by incor- porating additional components such as random translation perturbation and Multiplicative Perturbations for Privacy 165 noise addition to the basic form of multiplicative perturbation Y = R×X.We show that by adding random translation perturbation and noise addition, Geo- metric perturbation exhibits more robustness in countering attacks than simple rotation based perturbation [9]. Let td×1 represent a random vector. We define a translation matrix as follows. Definition 1 Ψ is a translation matrix if Ψ=[t, t,...,t]d×n, i.e., Ψd×n = td×11T N×1. where 1N×1 is the vector of N’1’s. Let ∆d×N be a random noise matrix, where each element is Independently and Identically Distributed (iid) variable εij, e.g., a Gaussian noise N(0,σ2). The definition of geometric perturbation is given by a function G(X), G(X)=RX +Ψ+∆ Clearly, translation perturbation does not change distance, as for any pair of points x and y, (x+t)−(y+t) = x−y. Comparing with rotation pertur- bation, it protects the rotation center from attacks and adds additional difficulty to ICA-based attacks. However, translation perturbation does not preserve in- ner product. In [9], it shows that by adding an appropriate level of noise ∆, one can effec- tively prevent knowledgeable attackers from distance-based data reconstruc- tion, since noise addition perturbs distances, which protects perturbation from distance-inference attacks. For example, the experiments in [9] shows that a Gaussian noise N(0,σ2) is effective to counter the distance-inference attacks. Although noise addition prevents from fully preserving distance information, a low intensity noise will not change class boundary or cluster membership much. In addition, the noise component is optional − ifthedataownermakessure that the original data records are secure and no people except the data owner knows any record in the original dataset, the noise component can be removed from geometric perturbation. 7.3 Transformation Invariant Data Mining Models By using multiplicative perturbation algorithms, we can mine the the perturbed data directly with a set of existing “transformation-invariant data mining mod- els”, instead of developing new data mining algorithms to mine the perturbed data [4]. In this section, we will define the concept of transformation-invariant mining models with the example of “transformation-invariant classifiers”, and then we extend our discussion to the transformation-invariant models in data classification and data clustering. 166 Privacy-Preserving Data Mining: Models and Algorithms 7.3.1 Definition of Transformation Invariant Models Generally speaking, a transformation invariant model, if trained or mined on the transformed data, performs as good as the model based on the original data. We take the classification problem as an example. A classification problem is also a function approximation problem − classifiers are the functions learned from the training data [16]. In the following discussion, we use functions to represent classifiers. Let ˆfX represent a classifier ˆf trained with dataset X and ˆfX(Y) be the classification result on the dataset Y.LetT(X) be any transformation function, which transforms the dataset X to another dataset XT.WeuseErr( ˆfX(Y)) to denote the error rate of classifier ˆfX on testing data Y and let ε be some small real number, |ε| < 1. Definition 2 A classifier ˆf is invariant to a transformation T if and only if Err( ˆfX(Y)) = Err( ˆfT(X)(T(Y)))+ε for any training dataset X and testing dataset Y. With the strict condition ˆfX(Y) ≡ ˆfT(X)(T(Y)), we get the Proposition 2. Proposition 2 In particular, if ˆfX(Y) ≡ ˆfT(X)(T(Y)) is satisfied for any training dataset X and testing dataset Y, the classifier is invariant to the trans- formation T(X). For instance, if a classifier ˆf is invariant to rotation transformation, we call it rotation-invariant classifier. Similar definition applies to translation-invariant classifier. In subsequent sections, we will list some examples of transformation invari- ant models for classification and clustering. Some detailed proofs can be found in [7]. 7.3.2 Transformation-Invariant Classification Models kNN Classifiers and Kernel Methods A k-Nearest-Neighbor (kNN) classifier determines the class label of a point by looking at the labels of its k nearest neighbors in the training dataset and classifies the point to the class that most of its neighbors belong to. Since the distances between any points are not changed with rotation and translation transformation, the k nearest neighbors are not changed and thus the classifi- cation result is not changed either. Since kNN classifier is a special case of kernel methods, we can also ex- tend our conclusion to kernel methods. Here, we refer kernel methods to the traditional local methods [16]. In general, since the kernels are dependent on the local points, the locality of which is evaluated by distance, transformations that preserve distance will make kernel methods invariant. Multiplicative Perturbations for Privacy 167 Support Vector Machines Support Vector Machine (SVM) classifier also utilizes kernel functions in train- ing and classification. However, it has an explicit training procedure, which differentiates itself from the traditional kernel methods we just discussed. We can use a two-step procedure to prove that a SVM classifier is invariant to a transformation. 1) Training with the transformed dataset generates the same set of model parameters; 2) the classification function with the model parame- ters is also invariant to the transformation. The detailed proof will involve the quadratic optimization procedure for SVM. We have demonstrated that SVM classifiers with typical kernels are invariant to rotation transformation [7]. It turns out that if a transformation makes the kernel invariant, then the SVM classifier is also invariant to the transformation. There are the three popular choices for the kernels discussed in the SVM literature [10, 16]. d-th degree polynomial: K(x, x)=(1+x, x)d, radial basis: K(x, x)=exp(−x − x/c), neural network: K(x, x) = tanh(κ1x, x + κ2) Apparently, all of the three are invariant to rotation transformation. Since trans- lation does not preserve inner product, it is not straightforward to prove that SVMs with polynomial and neural network kernels are invariant to translation perturbation. However, experiments [9] showed that these classifiers are also invariant to translation perturbation. Linear Classifiers Linear classification models are popular methods due to their simplicity. In linear classification models, the classification boundary is modeled as a hy- perplane, which is clearly a geometric concept. It is easy to understand that distance-preserving transformations, such as rotation and translation, will still make the classes separated if they are originally separated. There is also a de- tailed proof showing that a typical linear classifier, perceptron, is invariant to rotation transformation [7]. 7.3.3 Transformation-Invariant Clustering Models Most clustering models are based on Euclidean distance such as the popular k-means algorithm [16]. Many are focused on the density property, which is derived from Euclidean distance, such as DBSCAN [11], DENCLUE [17] and OPTICS [6]. All of these clustering models are invariant to Euclidean-distance- preserving transformations, such as rotation and translation. There are other clustering models, which employ different distance metrics [19], such as linkage based clustering and cosine-distance based clustering. As 168 Privacy-Preserving Data Mining: Models and Algorithms long as we can find a transformation preserving the particular distance metric, the corresponding clustering model will be invariant to this transformation. 7.4 Privacy Evaluation for Multiplicative Perturbation The goal of data perturbation is twofold: preserving the accuracy of spe- cific data mining models (data utility), and preserving the privacy of original data (data privacy). The discussion about transformation-invariant data mining models has shown that multiplicative perturbations can theoretically guarantee zero-loss of accuracy for a number of data mining models. The challenge is to find one that maximizes the privacy guarantee in terms of potential attacks. We dedicate this section to discuss how good a multiplicative perturbation is in terms of preserving privacy under a set of privacy attacks. We first de- fine a multi-column (or multidimensional) privacy measure for evaluating the privacy quality of a multiplicative perturbation over a given dataset. Then, we introduce a framework of privacy evaluation, which can incorporate different attack analysis into the evaluation of privacy guarantee. We show that using this framework, we can employ certain optimization methods (Section 7.5.5) to find a good perturbation among a bunch of randomly generated perturba- tions, which is locally optimal for the given dataset. 7.4.1 A Conceptual Multidimensional Privacy Evaluation Model In practice, different columns (or dimensions, or attributes) may have differ- ent privacy concern. Therefore, we advocate that the general-purpose privacy metric Φ defined for an entire dataset should be based on column privacy metric, rather than point-based privacy metrics, such distance-based metrics. A conceptual privacy model is defined as Φ=Φ(p, w),wherep denotes the column privacy metric vector p =[p1,p2,...,pd] of a given dataset X,and w =(w1,w2,...,wd) denote privacy weights associated to the d columns respectively. The column privacy pi itself is defined by a function, which we will discuss later. In summary, the model suggests that the column-wise pri- vacy metric should be calculated first and then use Φ to generate a composite metric. We will first describe some basic designs to the components in function Φ. Then, we dedicate another subsection to the concrete design of the function for generating p. The first design idea is to take the column importance into unification of dif- ferent column privacy. Intuitively, the more important the column is, the higher level of privacy guarantee will be required for the perturbed data column. Since w is used to denote the importance of columns in terms of preserving privacy, we use pi/wi to represent the weighted column privacy of column i. Multiplicative Perturbations for Privacy 169 The second concept is the minimum privacy guarantee and the average pri- vacy guarantee among all columns. Normally, when we measure the privacy guarantee of a multidimensional perturbation, we need to pay more attention to the column that has the lowest weighted column privacy, because such a column could become the weakest link of privacy protection. Hence, the first composition function is the minimum privacy guarantee. Φ1 = d min i=1 {pi/wi} Similarly, the average privacy guarantee of the multi-column perturbation is defined by Φ2 = 1 d d i=1 pi/wi, which could be another interesting measure. Note that these two functions assume that pi should be comparable crossing columns, which is one of the important requirement in the following discus- sion. 7.4.2 Variance of Difference as Column Privacy Metric After defining the conceptual privacy model, we move to the design of column-wise privacy metric. Intuitively, for a data perturbation approach, the quality of preserved privacy can be understood as the difficulty level of esti- mating the original data from the perturbed data. Therefore, how statistically different the estimated data is from the original data could be an intuitive mea- sure. We use a variance-of-difference (VoD) based approach, which has a sim- ilar form to the naive variance-based evaluation [4], but with very different semantics. Let the difference between the original column data and the estimated data be a random variable Di. Without any knowledge about the original data, the mean and variance of the difference present the quality of the estimation. The perfect estimation will have zero mean and variance. Since the mean of differ- ence, i.e., the bias of estimation, can be easily removed if the attacker knows the original distribution of column, we use only the variance of the difference (VoD) as the primary metric to determine the level of difficulty in estimating the original data. VoDis formally defined as follows. Let Xi be a random variable represent- ing the column i,X i be the estimated result1 of Xi,andDi be Di = X i − Xi. Let E[Di] and Var(Di) denote the mean and the variance of D respectively. Then VoDfor column i is Var(Di). Let an estimate of certain value, say xi,be x i, σ = Var(Di),andc denote confidence parameter depending on both the distribution of Di and the corresponding confidence level. The corresponding 1It would not be appropriate to use only the perturbed data for privacy estimation, if we consider the potential attacks. 170 Privacy-Preserving Data Mining: Models and Algorithms original value xi in Xi is located in the range defined below: [x i − E[Di] − cσ, x i − E[Di]+cσ] By removing the effect of E[Di], the width of the estimation range, 2cσ, presents the quality of estimating the original value, which proportionally re- flects the level of privacy guarantee. The smaller range means better estima- tion, i.e., a lower level of privacy guarantee. For simplicity, we often use σ to represent the privacy level. VoDonly defines the privacy guarantee for a single column. However, we usually need to evaluate the privacy level of all perturbed columns together if a multiplicative perturbation is applied. The single-column VoDdoes not work across different columns since different column value ranges may result in very different VoDs. For example, the VoDof age may be much smaller than VoDof salary. Therefore, a same amount of VoDis not equally effective for columns with different value ranges. One straightforward method to unify the different value ranges is via normalization over the original dataset and the perturbed dataset. Normalization can be done with various ways, such as max/min normalization or standardized normalization [30]. After normaliza- tion, the level of privacy guarantee for each column should be approximately comparable. Note that normalization after VoD calculation, such as relative variance VoDi/V ar(Xi) is not appropriate, since small Var(Xi) will inap- propriately increase the value. 7.4.3 Incorporating Attack Evaluation Privacy evaluation has to consider the resilience to attacks as well. The VoD evaluation has a unique advantage in incorporating attack analysis in privacy evaluation. In general, let X be the normalized original dataset, P be the per- turbed dataset, and O be the estimated/observed dataset through “attack simu- lation”. We can calculate VoD(Xi,Oi) for the column i in terms of different attacks. For example, the attacks to rotation perturbation can be evaluated by following steps. Details will be discussed shortly. 1 Naive Estimation: O ≡ P; 2 ICA-based Reconstruction: Independent Component Analysis (ICA) is used to estimate R.Let ˆR be the estimate of R, and the estimated data ˆR−1P aligned with the known column statistics to get the dataset O; 3 Distance-based Inference: knowing a set of special points in X that can be mapped to certain set of points in P, so that the mapping helps to get the estimated rotation ˆR,andthenO = ˆR−1P. Multiplicative Perturbations for Privacy 171 7.4.4 Other Metrics Other metrics include distance-based risk of privacy breach, which was used to evaluate the level of privacy breach when a few pairs of original data points and their maps in perturbed data are known [27]. Assume ˆx is the estimate of an original point x.An-privacy breach occurs if ˆx − x≤x This roughly represents that, if the estimate is within an arbitrarily small local area around the original point, then the risk of privacy breach is high. How- ever, even though the estimated point is distant from the original point, the estimation can still be effective − large distance may only be determined by the difference between a few columns, while other columns may be very simi- lar. That is the reason why we should consider column-wise privacy metrics. 7.5 Attack Resilient Multiplicative Perturbations Attack analysis is the essential component in privacy evaluation of multi- plicative perturbation. The previous section has set up an evaluation model that can conveniently incorporate attack analysis through “attack simulation”. Namely, privacy attacks to multiplicative perturbations are the methods for es- timating original points (or values of particular columns) from the perturbed data, with certain level of additional knowledge about the original data. As the perturbed data goes public, the level of effectiveness is solely determined by the additional knowledge the attacker may have. In the following sections, we describe some potential inference attacks to multiplicative perturbations, primarily focused on rotation perturbation. These attacks are organized according to the different levels of knowledge that an attacker may have. We hope that, from this section the interested read- ers will have more ideas about the attacks to general multiplicative perturba- tions and are able to apply appropriate tools to counter attacks. Most content of this section can be found in the paper [9], and we will just present the basic ideas here. 7.5.1 Naive Estimation to Rotation Perturbation When the attacker knows no additional information, we call attacks under such circumstance as naive estimation, which simply estimates the original data from perturbed data. In this case, an appropriate rotation perturbation is enough to achieve high level of privacy guarantee. With the VoDmetric over the normalized data, we can formally analyze the privacy guarantee provided by the rotation perturbed data. Let X be the normalized dataset, X be the rotation of X,andId be the d-dimensional identity matrix. VoD of column i 172 Privacy-Preserving Data Mining: Models and Algorithms can be evaluated by Cov(X − X)(i,i) = Cov(RX − X)(i,i)(7.2) =((R − Id)Cov(X)(R − Id)T)(i,i) Let rij represent the element (i, j) in the matrix R,andcij be the element (i, j) in the covariance matrix of X. The VoD for ith column is computed as follows. Cov(X − X)(i,i) = d j=1 d k=1 rijrikckj − 2 d j=1 rijcij + cii (7.3) When the random rotation matrix is generated following the Haar distribu- tion, a considerable number of matrix entries are approximately independent normal distribution N(0, 1/d)[20]. For simplicity and easy understanding, we assume that all entries in random rotation matrix approximately follow inde- pendent normal distribution N(0, 1/d). Therefore, random rotations will make VoDi changing around the mean value cii as shown in the following equation. E[VoDi] ∼ d j=1 d k=1 E[rij]E[rik]ckj − 2 d j=1 E[rij]cij + cii = cii It means that the original column variance could substantially influence the result of random rotation. However, the expectation of VoDs is not the only factor determining the final privacy guarantee. We should also look at the vari- ance of VoDs. If the variance of VoDs is considerably large, we still get great chance to find a rotation with high VoDs in a set of sample random rotations, and the larger the Var(VoDi) is, the more likely the randomly generated ro- tation matrices can provide a high privacy level. With the approximately inde- pendency assumption, we have Var(VoDi) ∼ d i=1 d j=1 Var(rij)Var(rik)c2 ij +4 d j=1 Var(rij)c2 ij ∼ O(1/d2 d i=1 d j=1 c2 ij +4/d d j=1 c2 ij). The above result shows that Var(VoDi) seems approximately related to the average of the squared covariance entries, with more influence from the row Multiplicative Perturbations for Privacy 173 i of covariance matrix. Therefore, by looking at the covariance matrix of the original dataset and estimate the Var(VoDi), we can estimate the chance of finding a random rotation that can give high privacy guarantee. Rotation Center. The basic rotation perturbation uses the origin as the ro- tation center. Therefore, the points around the origin will be still close to the origin after the perturbation, which leads to weaker privacy protection over these points. The attack to rotation center can be regarded as another kind of naive estimation. This problem is addressed by random translation perturba- tion, which hides the rotation center. More sophisticated attacks to the combi- nation of rotation and translation would have to utilize the ICA technique with sufficient additional knowledge, which will be described shortly. 7.5.2 ICA-Based Attacks In this section, we introduce a high-level attack based on data reconstruc- tion. The basic method for reconstructing X from the perturbed data RX would be Independent Component Analysis (ICA) technique, derived from the research of signal processing [18]. The ICA technique can be applied to estimate the independent components (the row vectors in our definition) of the original dataset X from the perturbed data, if the following conditions are satisfied: 1 The source row vectors are independent; 2 All source row vectors should be non-Gaussian with possible exception of one row; 3 The number of observed row vectors must be at least as large as the independent source row vectors. 4 The transformation matrix R must be of full column rank. For rotation matrices, the 3rd and 4th conditions are always satisfied. How- ever, the first two conditions although practical for signal processing, are of- ten not satisfied in data classification or clustering. Furthermore, there are a few more difficulties in applying direct ICA-based attack. First of all, even ICA can be done successfully, the order of the original independent compo- nents cannot be preserved or determined through only ICA [18]. Formally, any permutation matrix P and its inverse P −1 can be substituted in the model to give X = RP −1PX. ICA could possibly give the estimate for some permu- tated source PX. Thus, we cannot identify the particular column without more knowledge about the original data. Second, even if the ordering of columns can be identified, ICA reconstruction does not guarantee to preserve the variance of the original signal − the estimated signal is often scaled up but we do not know how much the scaling is unless we know the original value range of the 174 Privacy-Preserving Data Mining: Models and Algorithms column. Therefore, without knowing the basic statistics of original columns, ICA-attack is not effective. However, such basic column statistics are not impossible to get in some cases. Now, we assume that attackers know the basic statistics, including the column max/min values and the probability density function (PDF), or empir- ical PDF of each column. An enhanced ICA-based attack can be described as follows. 1 Run ICA algorithm to get a reconstructed dataset; 2 For each pair of (Oi,Xj), where Oi is a reconstructed column and Xi is an original column, scale Oi with the max/min values of Xj; 3 Compare the PDFs of the scaled Oi and Xj to find the closest match among all possible combinations. Note the the PDFs should be aligned before comparison. [9] gives one method to align it. The above procedure describes how to use ICA and additional knowledge about the original dataset to precisely reconstruct the original dataset. Note if the four conditions for effective ICA are exactly satisfied and the basic statis- tics and PDFs are all known distinct from each other, the basic rotation per- turbation will be totally broken by the enhanced ICA-based attack. In practice, we can test if the first two conditions for effective ICA are satisfied to decide whether we can safely use rotation perturbation, when the column distribu- tional information is released. If ICA-based attacks can be effectively done, it is also trivial to reveal an additional translation perturbation, which is used to protect the rotation center. If the first and second conditions are not satisfied, as for most datasets in data classification and clustering, precise ICA reconstruction cannot be achieved. Under this circumstance, different rotation perturbations may result in differ- ent levels of privacy guarantee and the goal is to find one perturbation that is resilient to the enhanced ICA-based attacks. For projection perturbation [28], the third condition of effective ICA is not satisfied either. Although overcomplete ICA is available for this particular case [25], it is generally ineffective to break projection perturbation with ICA-based attacks. The major concern of projection perturbation is to find one that pre- serves the utility of perturbed data. 7.5.3 Distance-Inference Attacks In the previous sections, we have discussed naive estimation and ICA- based attacks. In the following discussion, we assume that, besides the in- formation necessary to perform the discussed attacks, the attacker manages to get more knowledge about the original dataset. We assume two scenarios: Multiplicative Perturbations for Privacy 175 1) s/he also knows at least d +1linearly independent original data records, X = {x1, x2,...,xd+1}; or 2) s/he can only get less then d linearly indepen- dent points. S/he then tries to find the mapping between these points and their images in the perturbed dataset, denoted by O = {o1, o2,...,od+1}, to break rotation perturbation and possible also translation perturbation. For both scenarios, it is possible to find the images of the known points in the perturbed data. Particularly, if a few original points are highly distinguish- able, such as “outliers”, their images in the perturbed data can be correctly identified with high probability for low-dimensional small datasets (< 4 di- mensions). With considerable cost, it is not impossible for higher dimensional and larger datasets by simple exhaustive search, although the probability to get the exact images is relatively low. For scenario 1), with the known mapping, the rotation R and translation t can be precisely calculated if the incomplete geometric perturbation G(X)=RX +Ψis applied. Therefore, the threat will be substantial to any other data point in the original dataset. rotation * * **** * * * * * * * * * * mapping Figure 7.1. Using known points and distance relationship to infer the rotation matrix For scenario 2), if we assume the exact images of the known original points are identified, there is a comprehensive discussion about the potential privacy breach to rotation perturbation [27]. For rotation perturbation, i.e., O = RX between the known points X and their images O,ifX consists of less than d points, there are numerous estimates of R, denoted by ˆR, satisfying the rela- tionship between X and O. The weakest points, except the known points X, are those around X. Paper [27] gives some estimation to the risk of privacy breach for certain point x if a set of points X and their image O are known. The definition is based on -privacy breach (Section 7.4.1). The probability of  -privacy breach, ρ(x,),foranyx in the original dataset can be estimated as follows. Let d(x,X) be the distance between x and X. ρ(x,)=2 π arcsin( x 2d(x,X)), if x < 2d(x,X); 1 otherwise. 176 Privacy-Preserving Data Mining: Models and Algorithms Note that -privacy breach is not sufficient to column-wise privacy evaluation. Thus, the above definition may not be sufficient as well. In order to protect from distance-inference attack for both scenarios, an ad- ditional noise component ∆ is introduced to form the complete version of geo- metric perturbation G(X)=RX +Ψ+∆,where∆=[δ1,δ2,...,δN],and δi is a d-dimensional Gaussian random vector. The ∆ component reduces the probability of getting exact images and the precision of estimation to R and Ψ, which significantly increases the resilience to distance-inference attacks. Assume the attacker still knows enough pairs of independent (point, image). Now, with the additional noise component, the most effective way to estimate the rotation/translation component is linear regression. The steps include 1) fil- tering out the translation component first; 2) applying linear regression to esti- mate R; 3) plugging the estimate ˆR back to estimate the translation component; 4) estimating the original data with ˆR and ˆΨ. There is a detailed procedure in [9]. We can simulate the procedure to estimate the resilience of a perturbation. Note that the additional noise component also implies that we have to sac- rifice some model accuracy for gaining the stronger privacy protection. An empirical study has been performed on a bunch of datasets to evaluate the rela- tionship between noise intensity, resilience to attacks and model accuracy [9]. In general, a low-intense noise component will be enough to reduce the risk of being attacked, while still preserving model accuracy. However, the noise component is required only when the data owner is sure that a small part of the original data is released. 7.5.4 Attacks with More Prior Knowledge There are also extreme cases that may not happen in practice, which assume the attacker knows a considerable amount of original data points and these points form a sample set that the higher-order statistical properties of the orig- inal dataset, like the covariance matrix, are approximately estimated from the sample set. By using the sample statistics and the sample points, the attacker can have more effective attacks. Note that, in general, if the attacker has known so much information about the original data, its privacy may already be breached. It should not be advised to publish more original data. Further discussion about perturbations will make less sense. However, the techniques developed in these attacks, such as PCA- based attack [27] and AK-ICA attack [15] might be eventually utilized in other aspects to enhance multiplicative perturbations in the future. We will not give detailed description about these attacks due to the space limitation. Instead, they will be covered by another dedicated chapter. Multiplicative Perturbations for Privacy 177 7.5.5 Finding Attack-Resilient Perturbations We have discussed the unified privacy metric for evaluating the quality of a random geometric perturbation. Some known inference attacks have been an- alyzed under the framework of multi-column privacy evaluation, which allows us to design an algorithm to choose a good geometric perturbation in terms of these attacks − if the attacker knows considerable amount of original data, it is advised not to release the perturbed dataset, however. A deterministic al- gorithm in optimizing the perturbation may also provide extra clue to privacy attackers. Therefore, it is also expected to have certain level of randomization in the perturbation optimization. A randomized perturbation-optimization algorithm for geometric perturba- tion was proposed in [9]. We briefly describe it as follows. Algorithm 1 is a hill-climbing method, which runs in a given number of iterations to find a geometric perturbation that maximizes the minimum privacy guarantee as pos- sible. Initially, a random translation is selected, which needs not optimization at all. In each iteration, the algorithm randomly generates a rotation matrix. Local maximization of VoD [9] is applied to find a better rotation matrix in terms of naive estimation, which is then tested by the ICA reconstruction with the algorithm described in section 7.5.2. The rotation matrix is accepted as the currently best perturbation if it provides higher minimum privacy guarantee than the previous perturbations. After the iterations, if necessary, a noise com- ponent is appended to the perturbation, so that the distance-inference attack cannot reduce the privacy guarantee to a safety level φ, e.g., φ =0.2. Algo- rithm 1 outputs the rotation matrix Rt, the random translation matrix Ψ,the noise level σ2, and the corresponding privacy guarantee (we use minimum pri- vacy guarantee in the following algorithm) in terms of the known attacks. If the final privacy guarantee is lower than the expected threshold, the data owner can select not to release the data. This algorithm provides a framework, in which any discovered attacks can be simulated and evaluated. 7.6 Conclusion We have reviewed the multiplicative perturbation method as an alterna- tive method to privacy preserving data mining. The design of this category of perturbation algorithms is based on an important principle: by developing perturbation algorithms that can always preserve the mining task and model specific data utility, one can focus on finding a perturbation that can provide higher level of privacy guarantee. We described three representative multiplica- tive perturbation methods − rotation perturbation, projection perturbation, and geometric perturbation. All aim at preserving the distance relationship in the original data, thus achieving good data utility for a set of classification and clustering models. Another important advantage of using these multiplicative 178 Privacy-Preserving Data Mining: Models and Algorithms Algorithm 1 Finding a resilient perturbation (Xd×N, w, m) Input:Xd×N:the original dataset, w: weights for attributes in privacy evaluation, m: the number of iterations. Output:Rt: the selected rotation matrix, Ψ: the random translation, σ2: the noise level, p: privacy quality calculate the covariance matrix C of X; p =0, and randomly generate the translation Ψ; for Each iteration do randomly generate a rotation matrix R; swapping the rows of R to get R, which maximizes min1≤i≤d{ 1 wi (Cov(RX − X)(i,i)}; p0 = the privacy guarantee of R, p1 =0; if p0 >pthen generate ˆX with ICA; {(1),(2),...,(d)} = argmin{(1),(2),...,(d)} d i=1 ∆PDF(Xi,O(i)) p1 = min1≤k≤d 1 wk VoD(Xk,O(k)) end if if pφif p<φ. perturbation methods is the fact that we are not required to re-design the exist- ing data mining algorithms in order to perform data mining over the perturbed data. Privacy evaluation and attack analysis are the major challenging issues for multiplicative perturbations. We reviewed the multi-column variance of dif- ference (VoD) based evaluation method and the distance-based method. Since column distribution information has high probability to be released publicly, in principle it is necessary to evaluate privacy guarantee based on columns. Although this chapter does not intend to enumerate all possible attacks, as we know, attack analysis to multiplicative perturbation is still a very active area, we describe several types of attacks and organize the discussion according to the level of knowledge that the attacker may have about the original data. We also outlined some techniques developed to date for addressing these attacks. Based on attack analysis and the VoD-based evaluation method, we show how to find the perturbations that locally optimize the level of privacy guarantee in terms of various attacks. Acknowledgment This work is partially supported by grants from NSF CISE CyberTrust pro- gram, IBM faculty award 2006, and an AFOSR grant. Multiplicative Perturbations for Privacy 179 References [1] AGGARWAL,C.C.,AND YU, P. S. A condensation approach to pri- vacy preserving data mining. Proc. of Intl. Conf. on Extending Database Technology (EDBT) 2992 (2004), 183–199. [2] AGGARWAL,C.C.,AND YU, P. S. On privacy-preservation of text and sparse binary data with sketches. SIAM Data Mining Conference (2007). [3] AGRAWAL,D.,AND AGGARWAL, C. C. On the design and quantifica- tion of privacy preserving data mining algorithms. Proc. of ACM PODS Conference (2002). [4] AGRAWAL,R.,AND SRIKANT, R. Privacy-preserving data mining. Proc. of ACM SIGMOD Conference (2000). [5] ALON,N.,MATIAS,Y.,AND SZEGEDY, M. The space complexity of approximating the frequency moments. Proc. of ACM PODS Conference (1996). [6] ANKERST,M.,BREUNIG,M.M.,KRIEGEL,H.-P.,AND SANDER,J. OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999), 49–60. [7] CHEN,K.,AND LIU, L. A random geometric perturbation approach to privacy-preserving data classification. Technical Report, College of Computing, Georgia Tech (2005). [8] CHEN,K.,AND LIU, L. A random rotation perturbation approach to privacy preserving data classification. Proc. of Intl. Conf. on Data Mining (ICDM) (2005). [9] CHEN,K.,AND LIU, L. Towards attack-resilient geometric data pertur- bation. SIAM Data Mining Conference (2007). [10] CRISTIANINI,N.,AND SHAWE-TAYLOR,J.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [11] ESTER,M.,KRIEGEL,H.-P.,SANDER,J.,AND XU, X. A density- based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (1996), 226–231. [12] EVFIMIEVSKI,A.,GEHRKE,J.,AND SRIKANT, R. Limiting privacy breaches in privacy preserving data mining. Proc. of ACM PODS Con- ference (2003). [13] EVFIMIEVSKI,A.,SRIKANT,R.,AGRAWAL,R.,AND GEHRKE, J. Pri- vacy preserving mining of association rules. Proc. of ACM SIGKDD Conference (2002). 180 Privacy-Preserving Data Mining: Models and Algorithms [14] FEIGENBAUM,J.,ISHAI,Y.,MALKIN,T.,NISSIM,K.,STRAUSS,M., AND WRIGHT, R. N. Secure multiparty computation of approxima- tions. In ICALP ’01: Proceedings of the 28th International Colloquium on Automata, Languages and Programming, (2001), Springer-Verlag, pp. 927–938. [15] GUO,S.,AND WU, X. Deriving private information from arbitrarily projected data. In Proceedings of the 11th European Conference on Prin- ciples and Practice of Knowledge Discovery in Databases (PKDD07) (Warsaw, Poland, Sept 2007). [16] HASTIE,T.,TIBSHIRANI,R.,AND FRIEDMANN,J. The Elements of Statistical Learning. Springer-Verlag, 2001. [17] HINNEBURG,A.,AND KEIM, D. A. An efficient approach to cluster- ing in large multimedia databases with noise. Proc. of ACM SIGKDD Conference (1998), 58–65. [18] HYVARINEN,A.,KARHUNEN,J.,AND OJA,E. Independent Compo- nent Analysis. Wiley-Interscience, 2001. [19] JAIN,A.K.,AND DUBES, R. C. Data clustering: A review. ACM Com- puting Surveys 31 (1999), 264–323. [20] JIANG, T. How many entries in a typical orthogonal matrix can be ap- proximated by independent normals. To appear in The Annals of Proba- bility (2005). [21] JOHNSON,W.B.,AND LINDENSTRAUSS, J. Extensions of lipshitz map- ping into hilbert space. Contemporary Mathematics 26 (1984). [22] KARGUPTA,H.,DATTA,S.,WANG,Q.,AND SIVAKUMAR,K.On the privacy preserving properties of random data perturbation techniques. Proc. of Intl. Conf. on Data Mining (ICDM) (2003). [23] KIM,J.J.,AND WINKLER, W. E. Multiplicative noise for masking continuous data. Tech. Rep. Statistics #2003-01, Statistical Research Di- vision, U.S. Bureau of the Census, Washington D.C., April 2003. [24] LEFEVRE,K.,DEWITT,D.J.,AND RAMAKRISHNAN, R. Mondrain multidimensional k-anonymity. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE) (2006). [25] LEWICKI,M.S.,AND SEJNOWSKI, T. J. Learning overcomplet repre- sentations. Neural Computation 12, 2 (2000). [26] LINDELL,Y.,AND PINKAS, B. Privacy preserving data mining. Journal of Cryptology 15, 3 (2000), 177–206. [27] LIU,K.,GIANNELLA,C.,AND KARGUPTA, H. An attacker’s view of distance preserving maps for privacy preserving data mining. In Pro- ceedings of the 10th European Conference on Principles and Practice of Multiplicative Perturbations for Privacy 181 Knowledge Discovery in Databases (PKDD’06) (Berlin, Germany, Sep- tember 2006). [28] LIU,K.,KARGUPTA,H.,AND RYAN, J. Random projection-based mul- tiplicative data perturbation for privacy preserving distributed data min- ing. IEEE Transactions on Knowledge and Data Engineering (TKDE) 18, 1 (January 2006), 92–106. [29] MACHANAVAJJHALA,A.,GEHRKE,J.,KIFER,D.,AND VENKITA- SUBRAMANIAM, M. l-diversity: Privacy beyond k-anonymity. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE) (2006). [30] NETER,J.,KUTNER,M.H.,NACHTSHEIM,C.J.,AND WASSERMAN, W. Applied Linear Statistical Methods. WCB/McGraw-Hill, 1996. [31] OLIVEIRA,S.R.M.,AND ZA¨IANE, O. R. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World (Toronto, Canada, August 2004), pp. 67–82. [32] SADUN,L.Applied Linear Algebra: the Decoupling Principle. Prentice Hall, 2001. [33] STEWART, G. The efficient generation of random orthogonal matrices with an application to condition estimation. SIAM Journal on Numerical Analysis 17 (1980). [34] SWEENEY, L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10,5 (2002). [35] VAIDYA,J.,AND CLIFTON, C. Privacy preserving k-means cluster- ing over vertically partitioned data. Proc. of ACM SIGKDD Conference (2003). Chapter 8 A Survey of Quantification of Privacy Preserving Data Mining Algorithms Elisa Bertino Department of Computer Science Purdue University bertino@cs.purdue.edu Dan Lin Department of Computer Science Purdue University lindan@cs.purdue.edu Wei Jiang Department of Computer Science Purdue University wjiang@cs.purdue.edu Abstract The aim of privacy preserving data mining (PPDM) algorithms is to extract rel- evant knowledge from large amounts of data while protecting at the same time sensitive information. An important aspect in the design of such algorithms is the identification of suitable evaluation criteria and the development of related benchmarks. Recent research in the area has devoted much effort to determine a trade-off between the right to privacy and the need of knowledge discovery. It is often the case that no privacy preserving algorithm exists that outperforms all the others on all possible criteria. Therefore, it is crucial to provide a compre- hensive view on a set of metrics related to existing privacy preserving algorithms so that we can gain insights on how to design more effective measurement and PPDM algorithms. In this chapter, we review and summarize existing criteria and metrics in evaluating privacy preserving techniques. Keywords: Privacy metric. 184 Privacy-Preserving Data Mining: Models and Algorithms 8.1 Introduction Privacy is one of the most important properties that an information system must satisfy. For this reason, several efforts have been devoted to incorporat- ing privacy preserving techniques with data mining algorithms in order to pre- vent the disclosure of sensitive information during the knowledge discovery. The existing privacy preserving data mining techniques can be classified ac- cording to the following five different dimensions [32]: (i) data distribution (centralized or distributed); (ii) the modification applied to the data (encryp- tion, perturbation, generalization, and so on) in order to sanitize them; (iii) the data mining algorithm which the privacy preservation technique is designed for; (iv) the data type (single data items or complex data correlations) that needs to be protected from disclosure; (v) the approach adopted for preserving privacy (heuristic or cryptography-based approaches). While heuristic-based techniques are mainly conceived for centralized datasets, cryptography-based algorithms are designed for protecting privacy in a distributed scenario by us- ing encryption techniques. Heuristic-based algorithms recently proposed aim at hiding sensitive raw data by applying perturbation techniques based on prob- ability distributions. Moreover, several heuristic-based approaches for hiding both raw and aggregated data through a hiding technique (k-anonymization, adding noises, data swapping, generalization and sampling) have been devel- oped, first, in the context of association rule mining and classification and, more recently, for clustering techniques. Given the number of different privacy preserving data mining (PPDM) tech- niques that have been developed in these years, there is an emerging need of moving toward standardization in this new research area, as discussed by Oliveira and Zaiane [23]. One step toward this essential process is to provide a quantification approach for PPDM algorithms to make it possible to evaluate and compare such algorithms. However, due to the variety of characteristics of PPDM algorithms, it is often the case that no privacy preserving algorithm ex- ists that outperforms all the others on all possible criteria. Rather, an algorithm may perform better than another one on specific criteria like privacy level, data quality. Therefore, it is important to provide users with a comprehensive set of privacy preserving related metrics which will enable them to select the most appropriate privacy preserving technique for the data at hand, with respect to some specific parameters they are interested in optimizing [6]. For a better understanding of PPDM related metrics, we next identify a proper set of criteria and the related benchmarks for evaluating PPDM algo- rithms. We then adopt these criteria to categorize the metrics. First, we need to be clear with respect to the concept of “privacy” and the general goals of a PPDM algorithm. In our society the privacy term is overloaded, and can, in general, assume a wide range of different meanings. For example, in the A Survey of Quantification of Privacy Preserving Data Mining Algorithms 185 context of the HIPAA1 Privacy Rule, privacy means the individual’s ability to control who has the access to personal health care information. From the orga- nizations point of view, privacy involves the definition of policies stating which information is collected, how it is used, and how customers are informed and involved in this process. Moreover, there are many other definitions of privacy that are generally related with the particular environment in which the privacy has to be guaranteed. What we need is a more generic definition, that can be in- stantiated to different environments and situations. From a philosophical point of view, Schoeman [26] and Walters [33] identify three possible definitions of privacy: Privacy as the right of a person to determine which personal information about himself/herself may be communicated to others. Privacy as the control over access to information about oneself. Privacy as limited access to a person and to all the features related to the person. In three definitions, what is interesting from our point of view is the concept of “Controlled Information Release”. From this idea, we argue that a definition of privacy that is more related with our target could be the following: “The right of an individual to be secure from unauthorized disclosure of information about oneself that is contained in an electronic repository”. Performing a final tuning of the definition, we consider privacy as “The right of an entity to be se- cure from unauthorized disclosure of sensible information that are contained in an electronic repository or that can be derived as aggregate and complex information from data stored in an electronic repository”. The last generaliza- tion is due to the fact that the concept of individual privacy does not even exist. As in [23] we consider two main scenarios. The first is the case of a Medical Database where there is the need to pro- vide information about diseases while preserving the patient identity. Another scenario is the classical “Market Basket” database, where the transactions re- lated to different client purchases are stored and from which it is possible to extract some information in form of association rules like “If a client buys a product X, he/she will purchase also Z with y% probability”. The first is an example where individual privacy has to be ensured by protecting from unau- thorized disclosure sensitive information in form of specific data items related to specific individuals. The second one, instead, emphasizes how not only the raw data contained into a database must be protected, but also, in some cases, the high level information that can be derived from non sensible raw data need 1Health Insurance Portability and Accountability Act 186 Privacy-Preserving Data Mining: Models and Algorithms to protected. Such a scenario justifies the final generalization of our privacy definition. In the light of these considerations, it is, now, easy to define which are the main goals a PPDM algorithm should enforce: 1 A PPDM algorithm should have to prevent the discovery of sensible in- formation. 2 It should be resistant to the various data mining techniques. 3 It should not compromise the access and the use of non sensitive data. 4 It should not have an exponential computational complexity. Correspondingly, we identify the following set of criteria based on which a PPDM algorithm can be evaluated. - Privacy level offered by a privacy preserving technique, which indicates how closely the sensitive information, that has been hidden, can still be estimated. - Hiding failure, that is, the portion of sensitive information that is not hidden by the application of a privacy preservation technique; - Data quality after the application of a privacy preserving technique, con- sidered both as the quality of data themselves and the quality of the data mining results after the hiding strategy is applied; - Complexity, that is, the ability of a privacy preserving algorithm to exe- cute with good performance in terms of all the resources implied by the algorithm. For the rest of the chapter, we first present details of each criteria through analyzing existing PPDM techniques. Then we discuss how to select proper metric under a specified condition. Finally, we summarize this chapter and outline future research directions. 8.2 Metrics for Quantifying Privacy Level Before presenting different metrics related to privacy level, we need to take into account two aspects: (i) sensitive or private information can be contained in the original dataset; and (ii) private information that can be discovered from data mining results. We refer to the first one as data privacy and the latter as result privacy. 8.2.1 Data Privacy In general, the quantification used to measure data privacy is the degree of uncertainty, according to which original private data can be inferred. The A Survey of Quantification of Privacy Preserving Data Mining Algorithms 187 higher the degree of uncertainty achieved by a PPDM algorithm, the better the data privacy is protected by this PPDM algorithm. For various types of PPDM algorithms, the degree of uncertainty is estimated in different ways. According to the adopted techniques, PPDM algorithms can be classified into two main categories: heuristic-based approaches and cryptography-based ap- proaches. Heuristic-based approaches mainly include four sub-categories: ad- ditive noise, multiplicative noise, k-anonymization, and statistical disclosure control based approaches. In what follows, we survey representative works of each category of PPDM algorithms and review the metrics used by them. Additive-Noise-based Perturbation Techniques. Thebasicideaofthe additive-noise-based perturbation technique is to add random noise to the ac- tual data. In [2], Agrawal and Srikant uses an additive-noise-based technique to perturb data. They then estimate the probability distribution of original numeric data values in order to build a decision tree classifier from perturbed training data. They introduce a quantitative measure to evaluate the amount of privacy offered by a method and evaluate the proposed method against this measure. The privacy is measured by evaluating how closely the original values of a modified attribute can be determined. In particular, if the perturbed value of an attribute can be estimated, with a confidence c, to belong to an interval [a, b], then the privacy is estimated by (b−a) with confidence c. However, this metric does not work well because it does not take into account the distribution of the original data along with the perturbed data. Therefore, a metric that considers all the informative content of data available to the user is needed. Agrawal and Aggarwal [1] address this problem by introducing a new privacy metric based on the concept of information entropy. More specifically, they propose an Ex- pectation Maximization (EM) based algorithm for distribution reconstruction, which converges to the maximum likelihood estimate of the original distrib- ution on the perturbed data. The measurement of privacy given by them con- siders the fact that both the perturbed individual record and the reconstructed distribution are available to the user as well as the perturbing distribution, as it is specified in [10]. This metric defines the average conditional privacy of an attribute A given other information, modeled with a random variable B, as 2h(A|B),whereh(A|B) is the conditional differential entropy of A given B representing a measure of uncertainty inherent in the value of A,giventhe value of B. Another additive-noise-based perturbation technique is by Rivzi and Haritsa [24]. They propose a distortion method to pre-process the data before execut- ing the mining process. Their privacy measure deals with the probability with which the user’s distorted entries can be reconstructed. Their goal is to en- sure privacy at the level of individual entries in each customer tuple. In other words, the authors estimate the probability that a given 1 or 0 in the true matrix 188 Privacy-Preserving Data Mining: Models and Algorithms representing the transactional database can be reconstructed, even if for many applications the 1’s and 0’s values do not need the same level of privacy. Evfimievski et al. [11] propose a framework for mining association rules from transactions consisting of categorical items, where the data has been ran- domized to preserve privacy of individual transactions, while ensuring at the same time that only true associations are mined. They also provide a formal definition of privacy breaches and a class of randomization operators that are much more effective in limiting breaches than uniform randomization. Accord- ing to Definition 4 from [11], an itemset A results in a privacy breach of level ρ if the probability that an item in A belongs to a non randomized transaction, given that A is included in a randomized transaction, is greater than or equal to ρ. In some scenarios, being confident that an item not present in the original transaction may also be considered a privacy breach. In order to evaluate the privacy breaches, the approach taken by Evfimievski et al. is to count the oc- currences of an itemset in a randomized transaction and in its sub-items in the corresponding non randomized transaction. Out of all sub-items of an itemset, the item causing the worst privacy breach is chosen. Then, for each combina- tion of transaction size and itemset size, the worst and the average value of this breach level are computed over all frequent itemsets. The itemset size giving the worst value for each of these two values is selected. Finally, we introduce a universal measure of data privacy level, proposed by Bertino et al. in [6]. The measure is developed based on [1]. The basic concept used by this measure is information entropy, which is defined by Shannon [27]: let X be a random variable which takes on a finite set of values according to a probability distribution p(x). Then, the entropy of this probability distribution is defined as follows: h(X)=− p(x)log2(p(x)) (8.1) or, in the continuous case: h(X)=− f(x)log2(f(x))dx (8.2) where f(x) denotes the density function of the continuous random variable x. Information entropy is a measure of how much “choice” is involved in the selection of an event or how uncertain we are of its outcome. It can be used for quantifying the amount of information associated with a set of data. The concept of “information associated with data” can be useful in the evaluation of the privacy achieved by a PPDM algorithm. Because the entropy represents the information content of a datum, the entropy after data sanitization should be higher than the entropy before the sanitization. Moreover the entropy can be assumed as the evaluation of the uncertain forecast level of an event which in our context is evaluation of the right value of a datum. Consequently, the level A Survey of Quantification of Privacy Preserving Data Mining Algorithms 189 of privacy inherent in an attribute X, given some information modeled by Y, is defined as follows: Π(X|Y)=2− fX,Y (x,y)log2 fX|Y =y(x))dxdy (8.3) The privacy level defined in equation 8.3 is very general. In order to use it in the different PPDM contexts, it needs to be refined in relation with some characteristics like the type of transactions, the type of aggregation and PPDM methods. In [6], an example of instantiating the entropy concept to evaluate the privacy level in the context of “association rules” is presented. However, it is worth noting that the value of the privacy level depends not only on the PPDM algorithm used, but also on the knowledge that an attacker has about the data before the use of data mining techniques and the relevance of this knowledge in the data reconstruction operation. This problem is under- lined, for example, in [29, 30]. In [6], this aspect is not considered, but it is possible to introduce assumptions on attacker knowledge by properly model- ing Y. Multiplicative-Noise-based Perturbation Techniques. According to [16], additive random noise can be filtered out using certain signal processing tech- niques with very high accuracy. To avoid this problem, random projection- based multiplicative perturbation techniques has been proposed in [19]. Instead of adding some random values to the actual data, random matrices are used to project the set of original data points to a randomly chosen lower-dimensional space. However, the transformed data still preserves much statistical aggregates regarding the original dataset so that certain data mining tasks (e.g., computing inner product matrix, linear classification, K-means clustering and computing Euclidean distance) can be performed on the transformed data in a distributed environment (data are either vertically partitioned or horizontally partitioned) with small errors. In addition, this approach provides a high degree of privacy regarding the original data. As analyzed in the paper, even if the random matrix (i.e., the multiplicative noise) is disclosed, it is impossible to find the exact values of the original dataset, but finding approximation of the original data is possible. The variance of the approximated data is used as privacy measure. Oliveira and Zaiane [22] also adopt a multiplicative-noise-based perturba- tion technique to perform a clustering analysis while ensuring at the same time privacy preservation. They have introduced a family of geometric data trans- formation methods where they apply a noise vector to distort confidential nu- merical attributes. The privacy ensured by such techniques is measured as the variance difference between the actual and the perturbed values. This measure is given by Var(X − Y),whereX represents a single original attribute and Y 190 Privacy-Preserving Data Mining: Models and Algorithms the distorted attribute. This measure can be made scale invariant with respect to the variance of X by expressing security as Sec = Var(X − Y)/V ar(X). k-Anonymization Techniques. The concept of k-anonymization is intro- duced by Samarati and Sweeney in [25, 28]. A database is k-anonymous with respect to quasi-identifier attributes (a set of attributes that can be used with certain external information to identify a specific individual) if there exist at least k transactions in the database having the same values according to the quasi-identifier attributes. In practice, in order to protect sensitive dataset T, before releasing T to the public, T is converted into a new dataset T ∗ that guarantees the k-anonymity property for a sensible attribute by performing some value generalizations on quasi-identifier attributes. Therefore, the degree of uncertainty of the sensitive attribute is at least 1/k. Statistical-Disclosure-Control-based Techniques. In the context of sta- tistical disclosure control, a large number of methods have been developed to preserve individual privacy when releasing aggregated statistics on data. To anonymize the released statistics from those data items such as person, house- hold and business, which can be used to identify an individual, not only fea- tures described by the statistics but also related information publicly available need to be considered [35]. In [7] a description of the most relevant perturba- tion methods proposed so far is presented. Among these methods specifically designed for continuous data, the following masking techniques are described: additive noise, data distortion by probability distribution, resampling, microag- gregation, rank swapping, etc. For categorical data both perturbative and non- perturbative methods are presented. The top-coding and bottom-coding tech- niques are both applied to ordinal categorical variables; they recode, respec- tively, the first/last p values of a variable into a new category. The global- recoding technique, instead, recodes the p lowest frequency categories into a single one. The privacy level of such method is assessed by using the disclosure risk, that is, the risk that a piece of information be linked to a specific individual. There are several approaches to measure the disclosure risk. One approach is based on the computation of the distance-based record linkage. An intruder is assumed to try to link the masked dataset with the external dataset using the key variables. The distance between records in the original and the masked datasets is computed. A record in the masked dataset is labelled as “linked” or “linked to 2nd nearest” if the nearest or 2nd nearest record in the original dataset turns out to be the corresponding original record. Then the disclosure risk is computed as the percentage of “linked” and “linked to 2nd nearest”. The second approach is based on the computation of the probabilistic record link- age. The linear sum assignment model is used to ‘pair’ records in the original A Survey of Quantification of Privacy Preserving Data Mining Algorithms 191 file and the masked file. The percentage of correctly paired records is a measure of disclosure risk. Another approach computes rank intervals for the records in the masked dataset. The proportion of original values that fall into the interval centered around their corresponding masked value is a measure of disclosure risk. Cryptography-based Techniques. The cryptography-based technique usu- ally guarantees very high level of data privacy. In [14], Kantarcioglu and Clifton address the problem of secure mining of association rules over hori- zontally partitioned data, using cryptographic techniques to minimize the in- formation shared. Their solution is based on the assumption that each party first encrypts its own itemsets using commutative encryption, then the already en- crypted itemsets of every other party. Later on, an initiating party transmits its frequency count, plus a random value, to its neighbor, which adds its frequency count and passes it on to other parties. Finally, a secure comparison takes place between the final and initiating parties to determine if the final result is greater than the threshold plus the random value. Another cryptography-based approach is described in [31]. Such approach addresses the problem of association rule mining in vertically partitioned data. In other words, its aim is to determine the item frequency when transactions are split across different sites, without revealing the contents of individual transac- tions. The security of the protocol for computing the scalar product is analyzed. Though cryptography-based techniques can well protect data privacy, they may not be considered good with respect to other metrics like efficiency that will be discussed in later sections. 8.2.2 Result Privacy So far, we have seen privacy metrics related to the data mining process. Many data mining tasks produce aggregate results, such as Bayesian classifiers. Although it is possible to protect sensitive data when a classifier is constructed, can this classifier be used to infer sensitive data values? In other words, do data mining results violate privacy? This issue has been analyzed and a framework is proposed in [15] to test if a classifier C creates an inference channel that could be adopted to infer sensitive data values. The framework considers three types of data: public data (P), accessible to every one including the adversary; private/sensitive data (S), must be protected and unknown to the adversary; unknown data (U), not known to the adversary, but the release of this data might cause privacy violation. The framework as- sumes that S depends only on P and U, and the adversary has at most t data samples of the form (pi,si). The approach to determine whether an inference channel exists is comprised of two steps. First, a classifier C1 is built on the t data samples. To evaluate the impact of C, another classifier C2 is built based 192 Privacy-Preserving Data Mining: Models and Algorithms on the same t data samples plus the classifier C. If the accuracy of C2 is sig- nificantly better than C1, we can say that C provides an inference channel for S. Classifier accuracy is measured based on Bayesian classification error. Sup- pose we have a dataset {x1,...,xn}, and we want to classify xi into m classes labelled as {1,...,m}. Given a classifier C: C: xi → C(xi) ∈{1,...,m},i=1,...,n The classifier accuracy for C is defined as: m j=1 Pr(C(xi) = j|z = j)Pr(z = j) where z is the actual class label of xi. Since cryptography-based PPDM tech- niques usually produce the same results as those mined from the original dataset, analyzing privacy implications from the mining results is particular important to this class of techniques. 8.3 Metrics for Quantifying Hiding Failure The percentage of sensitive information that is still discovered, after the data has been sanitized, gives an estimate of the hiding failure parameter. Most of the developed privacy preserving algorithms are designed with the goal of obtaining zero hiding failure. Thus, they hide all the patterns considered sen- sitive. However, it is well known that the more sensitive information we hide, the more non-sensitive information we miss. Thus, some PPDM algorithms have been recently developed which allow one to choose the amount of sen- sitive data that should be hidden in order to find a balance between privacy and knowledge discovery. For example, in [21], Oliveira and Zaiane define the hiding failure (HF) as the percentage of restrictive patterns that are discovered from the sanitized database. It is measured as follows: HF = #RP(D) #RP(D)(8.4) where #RP(D) and #RP(D) denote the number of restrictive patterns dis- covered from the original data base D and the sanitized database D respec- tively. Ideally, HF should be 0. In their framework, they give a specification of a disclosure threshold φ, representing the percentage of sensitive transactions that are not sanitized, which allows one to find a balance between the hiding failure and the number of misses. Note that φ does not control the hiding failure directly, but indirectly by controlling the proportion of sensitive transactions to be sanitized for each restrictive pattern. A Survey of Quantification of Privacy Preserving Data Mining Algorithms 193 Moreover, as pointed out in [32], it is important not to forget that intruders and data terrorists will try to compromise information by using various data mining algorithms. Therefore, a PPDM algorithm developed against a particu- lar data mining techniques that assures privacy of information, may not attain similar protection against all possible data mining algorithms. In order to pro- vide for a complete evaluation of a PPDM algorithm, we need to measure its hiding failure against data mining techniques which are different from the tech- nique that the PPDM algorithm has been designed for. The evaluation needs the consideration of a class of data mining algorithms which are significant for our test. Alternatively, a formal framework can be developed that upon testing of a PPDM algorithm against pre-selected data sets, we can transitively prove privacy assurance for the whole class of PPDM algorithms. 8.4 Metrics for Quantifying Data Quality The main feature of the most PPDM algorithms is that they usually modify the database through insertion of false information or through the blocking of data values in order to hide sensitive information. Such perturbation techniques cause the decrease of the data quality. It is obvious that the more the changes are made to the database, the less the database reflects the domain of interest. Therefore, data quality metrics are very important in the evaluation of PPDM techniques. Since the data is often sold for making profit, or shared with others in the hope of leading to innovation, data quality should have an acceptable level according also to the intended data usage. If data quality is too degraded, the released database is useless for the purpose of knowledge extraction. In existing works, several data quality metrics have been proposed that are either generic or data-use-specific. However, currently, there is no metric that is widely accepted by the research community. Here we try to identify a set of possible measures that can be used to evaluate different aspects of data quality. In evaluating the data quality after the privacy preserving process, it can be useful to assess both the quality of the data resulting from the PPDM process and the quality of the data mining results. The quality of the data themselves can be considered as a general measure evaluating the state of the individual items contained in the database after the enforcement of a privacy preserving technique. The quality of the data mining results evaluates the alteration in the information that is extracted from the database after the privacy preservation process, on the basis of the intended data use. 8.4.1 Quality of the Data Resulting from the PPDM Process The main problem with data quality is that its evaluation is relative [18], in that it usually depends on the context in which data are used. In particular, there 194 Privacy-Preserving Data Mining: Models and Algorithms are some aspects related to data quality evaluation that are heavily related not only with the PPDM algorithm, but also with the structure of the database, and with the meaning and relevance of the information stored in the database with respect to a well defined context. In the scientific literature data quality is gen- erally considered a multi-dimensional concept that in certain contexts involves both objective and subjective parameters [3, 34]. Among the various possible parameters, the following ones are usually considered the most relevant: - Accuracy: it measures the proximity of a sanitized value to the original value. - Completeness: it evaluates the degree of missed data in the sanitized database. - Consistency: it is related to the internal constraints, that is, the relation- ships that must hold among different fields of a data item or among data items in a database. Accuracy. The accuracy is closely related to the information loss result- ing from the hiding strategy: the less is the information loss, the better is the data quality. This measure largely depends on the specific class of PPDM al- gorithms. In what follows, we discuss how different approaches measure the accuracy. As for heuristic-based techniques, we distinguish the following cases based on the modification technique that is performed for the hiding process. If the algorithm adopts a perturbation or a blocking technique to hide both raw and aggregated data, the information loss can be measured in terms of the dissimi- larity between the original dataset D and the sanitized one D. In [21], Oliveira and Zaiane propose three different methods to measure the dissimilarity be- tween the original and sanitized databases. The first method is based on the difference between the frequency histograms of the original and the sanitized databases. The second method is based on computing the difference between the sizes of the sanitized database and the original one. The third method is based on a comparison between the contents of two databases. A more de- tailed analysis on the definition of dissimilarity is presented by Bertino et al. in [6]. They suggest to use the following formula in the case of transactional dataset perturbation: Diss(D, D)= n i=1 |fD(i) − fD (i)|n i=1 fD(i)(8.5) where i is a data item in the original database D and fD(i) is its frequency within the database, whereas i’ is the given data item after the application of A Survey of Quantification of Privacy Preserving Data Mining Algorithms 195 a privacy preservation technique and fD (i) is its new frequency within the transformed database D. As we can see, the information loss is defined as the ratio between the sum of the absolute errors made in computing the frequen- cies of the items from a sanitized database and the sum of all the frequencies of items in the original database. The formula 8.5 can also be used for the PPDM algorithms which adopt a blocking technique for inserting into the dataset un- certainty about some sensitive data items or their correlations. The frequency of the item i belonging to the sanitized dataset D is then given by the mean value between the minimum frequency of the data item i, computed by consid- ering all the blocking values associated with it equal to zero, and the maximum frequency, obtained by considering all the question marks equal to one. In case of data swapping, the information loss caused by an heuristic-based algorithm can be evaluated by a parameter measuring the data confusion in- troduced by the value swappings. If there is no correlation among the different database records, the data confusion can be estimated by the percentage of value replacements executed in order to hide specific information. For the multiplicative-noise-based approaches [19], the quality of the per- turbed data depends on the size of the random projection matrix. In general, the error bound of the inner product matrix produce by this perturbation technique is 0 on average and the variance is bounded by the inverse of the dimensionality of the reduced space. In other words, when the dimensionality of the random projection matrix is close to that of the original data, the result of computing the inner product matrix based on the transformed or projected data is also close to the actual value. Since inner product is closely related to many distance-based metrics (e.g., Euclidean distance, cosine angle of two vectors, correlation coef- ficient of two vectors, etc), the analysis on error bound has direct impact on the mining results if these data mining tasks adopt certain distance-based metrics. If the data modification consists of aggregating some data values, the infor- mation loss is given by the loss of detail in the data. Intuitively, in this case, in order to perform the hiding operation, the PPDM algorithms use some type of “Generalization or Aggregation Scheme” that can be ideally modeled as a tree scheme. Each cell modification applied during the sanitization phase using the Generalization tree introduces a data perturbation that reduces the general ac- curacy of the database. As in the case of the k-anonymity algorithm presented in [28], we can use the following formula. Given a database T with NA fields and N transactions, if we identify as generalization scheme a domain general- ization hierarchy GT with a depth h, it is possible to measure the information loss (IL) of a sanitized database T ∗ as: IL(T ∗)= i=NA i=1 i=N j=1 h |GTAi| |T|∗|NA| (8.6) 196 Privacy-Preserving Data Mining: Models and Algorithms where h |GTAi| represent the detail loss for each cell sanitized. For hiding tech- niques based on sampling approach, the quality is obviously related to the size of the considered sample and, more generally, on its features. There are some other precision metrics specifically designed for k- anonymization approaches. One of the earliest data quality metrics is based on the height of generalization hierarchies [25]. The height is the number of times the original data value has been generalized. This metric assumes that a generalization on the data represents an information loss on the original data value. Therefore, data should be generalized as fewer steps as possible to pre- serve maximum utility. However, this metric does not take into account that not every generalization steps are equal in the sense of information loss. Later, Iyengar [13] proposes a general loss metric (LM). Suppose T isadata table with n attributes. The LM metric is thought as the average information loss of all data cells of a given dataset, defined as follows: LM(T ∗)= n i=1 |T| j=1 f(T ∗[i][j])−1 g(Ai)−1 |T|·n (8.7) In equation 8.7, T ∗ is the anonymized table of T, f is a function that given a data cell value T ∗[i][j], returns the number of distinct values that can be generalized to T ∗[i][j],andg is a function that given an attribute Ai, returns the number of distinct values of Ai. The next metric, classification metric (CM), is introduced by Iyengar [13] to optimize a k-anonymous dataset for training a classifier. It is defined as the sum of the individual penalties for each row in the table normalized by the total number of rows N. CM(T ∗)= all rows penalty(row r) N(8.8) The penalty value of row r is 1, i.e., row r is penalized, if it is suppressed or if its class label is not the majority class label of its group. Otherwise, the penalty value of row r is 0. This metric is particularly useful when we want to build a classifier over anonymous data. Another interesting metric is the discernibility metric(DM) proposed by Bayado and Agrawal [4]. This discernibility metric assigns a penalty to each tuple based on how many tuples in the transformed dataset are indistinguish- able from it. Let t be a tuple from the original table T,andletGT ∗ (t) be the set of tuples in an anonymized table T ∗ indistinguishable from t or the set of tuples in T∗ equivalent to the anonymized value of t.ThenDM is defined as follows: DM(T ∗)= t∈T |GT ∗ (t)| (8.9) A Survey of Quantification of Privacy Preserving Data Mining Algorithms 197 Note that if a tuple t has been suppressed, the size of GT ∗ (t) is the same as the size of T ∗. In many situation, suppressions are considered to be most ex- pensive in the sense of information loss. Thus, to maximize data utility, tuple suppression should be avoided whenever possible. For any given metric M,ifM(T) >M(T ), we say T has a higher infor- mation loss, or is less precise, than b. In other words, data quality of T is worse than that of T . Is this true for all metrics? What is a good metric? It is not easy to answer these kinds of questions. As shown in [20], CM works better than LM in classification application. In addition, LM is better for association rule mining. It is apparent that to judge how good a particular metric is, we need to associate our judgement with specific applications (e.g., classification, mining association rules). The CM metric and the information gain privacy loss ratio [5, 28] are more interesting measure of utility because it considers the possible application for the data. Nevertheless, it is unclear what to do if we want to build classifiers on various attributes. In addition, these two metrics only work well if the data are intended to be used for building classifiers. Is there a utility metric that works well for various applications? Having this in mind, Kifer [17] proposes a utility measure related to Kullback-Leibler divergence. In theory, using this measure, better anonymous datasets (for different applications) can be produced. Re- searchers have measured the utility of the resulting anonymous datasets. Pre- liminary results show that this metric works well in practical applications. For the statistical-based perturbation techniques which aim to hide the val- ues of a confidential attribute, the information loss is basically the lack of pre- cision in estimating the original distribution function of the given attribute. As defined in [1], the information loss incurred during the reconstruction of estimating the density function fX(x) of the attribute X, is measured by com- puting the following value: I(fX, fX)=1 2E ΩX fX(x) − fX(x) dx (8.10) that is, half of the expected value of L1 norm between fX(x) and fX(x),which are the density distributions respectively before and after the application of the privacy preserving technique. When considering the cryptography-based techniques which are typically employed in distributed environments, we can observe that they do not use any kind of perturbation techniques for the purpose of privacy preserving. Instead, they use the cryptographic techniques to assure data privacy at each site by limiting the information shared by all the sites. Therefore, the quality of data stored at each site is not compromised at all. 198 Privacy-Preserving Data Mining: Models and Algorithms Completeness and Consistency. While the accuracy is a relatively general parameter in that it can be measured without strong assumptions on the dataset analyzed, the completeness is not so general. For example, in some PPDM strategies, e.g. blocking, the completeness evaluation is not significant. On the other hand, the consistency requires to determine all the relationships that are relevant for a given dataset. In [5], Bertino et al. propose a set of evaluation parameters including the completeness and consistency evaluation. Unlike other techniques, their ap- proach takes into account two more important aspects: relevance of data and structure of database. They provide a formal description that can be used to magnify the aggregate information of interest for a target database and the rel- evance of data quality properties of each aggregate information and for each attribute involved in the aggregate information. Specifically, the completeness lack (denoted as CML) is measured as follows: CML = n i=0 (DMG.Ni.CV × DMG.Ni.CW)(8.11) In equation 8.11, DMG is an oriented graph where each node Ni is an at- tribute class. CV is the completeness value and CW is the consistency value. The consistency lack (denoted as CSL) is given by the number of constraint violations occurred in all the sanitized transaction multiplied by the weight associated with every constraints. CSL = n i=0 (DMG.SCi.csv × DMG.SCi.cw) + m j=0 (DMG.CCj.csv × DMG.CCj.cw)(8.12) In equation 8.11, csv indicates the number of violations, cw is the weight of the constraint, SCi describes a simple constraint class, and CCj describes a complex constraint class. 8.4.2 Quality of the Data Mining Results In some situations, it can be useful and also more relevant to evaluate the quality of the data mining results after the sanitization process. This kind of metric is strictly related to the use the data are intended for. Data can be ana- lyzed in order to mine information in terms of associations among single data items or to classify existing data with the goal of finding an accurate clas- sification of new data items, and so on. Based on the intended data use, the information loss is measured with a specific metric, depending each time on the particular type of knowledge model one aims to extract. A Survey of Quantification of Privacy Preserving Data Mining Algorithms 199 If the intended data usage is data clustering, the information loss can be mea- sured by the percentage of legitimate data points that are not well-classified af- ter the sanitization process. As in [22], a misclassification error ME is defined to measure the information loss. ME = 1 N k i=1 (|Clusteri(D)|−|Clusteri(D)|)(8.13) where N represents the number of points in the original dataset, k is the num- ber of clusters under analysis, and |Clusteri(D)| and |Clusteri(D)| repre- sent the number of legitimate data points of the ith cluster in the original dataset D and the sanitized dataset D respectively. Since a privacy preserving tech- nique usually modify data for the sanitization purpose, the parameters involved in the clustering analysis is almost inevitably affected. In order to achieve high clustering quality, it is very important to keep the clustering results as consis- tent as possible before and after the application of a data hiding technique. When quantifying information loss in the context of the other data usages, it is useful to distinguish between: lost information representing the percent- age of non-sensitive patterns (i.e., association, classification rules) which are hidden as side-effect of the hiding process; and the artifactual information representing the percentage of artifactual patterns created by the adopted pri- vacy preserving technique. For example, in [21], Oliveira and Zaiane define two metrics misses cost and artifactual pattern which are corresponding to lost information and artifactual information respectively. In particular, misses cost measures the percentage of non-restrictive patterns that are hidden after the sanitization process. This happens when some non-restrictive patterns lose support in the database due to the sanitization process. The misses cost (MC) is computed as follows: MC = # ∼ RP(D) − # ∼ RP(D) # ∼ RP(D)(8.14) where # ∼ RP(D) and # ∼ RP(D) denote the number of non-restrictive patterns discovered from the original database D and the sanitized database D respectively. In the best case, MC should be 0%. Notice that there is a com- promise between the misses cost and the hiding failure in their approach. The more restrictive patterns they hide, the more legitimate patterns they miss. The other metric, artifactual pattern (AP), is measured in terms of the percentage of the discovered patterns that are artifacts. The formula is: AP = |P |−|P P | P  (8.15) where |X| denotes the cardinality of X. According to their experiments, their approach does not have any artifactual patterns, i.e., AP is always 0. 200 Privacy-Preserving Data Mining: Models and Algorithms In case of association rules, the lost information can be modeled as the set of non-sensitive rules that are accidentally hidden, referred to as lost rules, by the privacy preservation technique, the artifactual information, instead, rep- resents the set of new rules, also known as ghost rules, that can be ex- tracted from the database after the application of a sanitization technique. Similarly, if the aim of the mining task is data classification, e.g. by means of decision trees inductions, both the lost and artifactual information can be quantified by means of the corresponding lost and ghost association rules de- rived by the classification tree. These measures allow one to evaluate the high level information that are extracted from a database in form of the widely-used inference rules before and after the application of a PPDM algorithm. It is worth noting that for most cryptography-based PPDM algorithms, the data mining results are the same as that produced from unsanitized data. 8.5 Complexity Metrics The complexity metric measures the efficiency and scalability of a PPDM algorithm. Efficiency indicates whether the algorithm can be executed with good performance, which is generally assessed in terms of space and time. Space requirements are assessed according to the amount of memory that must be allocated in order to implement the given algorithm. For the evaluation of time requirements, there are several approaches. The first approach is to evaluate the CPU time. For example, in [21], they first keep constant both the size of the database and the set of restrictive patterns, and then increase the size of the input data to measure the CPU time taken by their algorithm. An alternative approach would be to evaluate the time requirements in terms of the computational cost. In this case, it is obvious that an algorithm having a polynomial complexity is more efficient than another one with expo- nential complexity. Sometimes, the time requirements can even be evaluated by counting the average number of operations executed by a PPDM algorithm. As in [14], the performance is measured in terms of the number of encryption and decryption operations required by the specific algorithm. The last two mea- sures, i.e. the computational cost and the average number of operations, do not provide an absolute measure, but they can be considered in order to perform a fast comparison among different algorithms. In case of distributed algorithms, especially the cryptography-based algo- rithms (e.g. [14, 31]), the time requirements can be evaluated in terms of com- munication cost during the exchange of information among secure processing. Specifically, in [14], the communication cost is expressed as the number of messages exchanged among the sites, that are required by the protocol for se- curely counting the frequency of each rule. A Survey of Quantification of Privacy Preserving Data Mining Algorithms 201 Scalability is another important aspect to assess the performance of a PPDM algorithm. In particular, scalability describes the efficiency trends when data sizes increase. Such parameter concerns the increase of both performance and storage requirements as well as the costs of the communications required by a distributed technique with the increase of data sizes. Due to the continuous advances in hardware technology, large amounts of data can now be easily stored. Databases along with data warehouses today store and manage amounts of data which are increasingly large. For this rea- son, a PPDM algorithm has to be designed and implemented with the capability of handling huge datasets that may still keep growing. The less fast is the de- crease in the efficiency of a PPDM algorithm for increasing data dimensions, the better is its scalability. Therefore, the scalability measure is very important in determining practical PPDM techniques. 8.6 How to Select a Proper Metric In previous section, we have discussed various types of metrics. An im- portant question here is “which one among the presented metrics is the most relevant for a given privacy preserving technique?”. Dwork and Nissim [9] make some interesting observations about this ques- tion. In particular, according to them in the case of statistical databases privacy is paramount, whereas in the case of distributed databases for which the privacy is ensured by using a secure multiparty computation technique functionality is of primary importance. Since a real database usually contains a large number of records, the performance guaranteed by a PPDM algorithm, in terms of time and communication requirements, is a not negligible factor, as well as its trend when increasing database size. The data quality guaranteed by a PPDM algo- rithm is, on the other hand, very important when ensuring privacy protection without damaging the data usability from the authorized users. From the above observations, we can see that a trade-off metric may help us to state a unique value measuring the effectiveness of a PPDM algorithm. In [7], the score of a masking method provides a measure of the trade-off be- tween disclosure risk and information loss. It is defined as an average between the ranks of disclosure risk and information loss measures, giving the same importance to both metrics. In [8], a R-U confidentiality map is described that traces the impact on disclosure risk R and data utility U of changes in the parameters of a disclosure limitation method which adopts an additive noise technique. We believe that an index assigning the same importance to both the data quality and the degree of privacy ensured by a PPDM algorithm is quite restrictive, because in some contexts one of these parameters can be more rel- evant than the other. Moreover, in our opinion the other parameters, even less relevant ones, should be also taken into account. The efficiency and scalability 202 Privacy-Preserving Data Mining: Models and Algorithms measures, for instance, could be discriminating factors in choosing among a set of PPDM algorithms that ensure similar degrees of privacy and data utility. A weighted mean could be, thus, a good measure for evaluating by means of a unique value the quality of a PPDM algorithm. 8.7 Conclusion and Research Directions In this chapter, we have surveyed different approaches used in evaluating the effectiveness of privacy preserving data mining algorithms. A set of criteria is identified, which are privacy level, hiding failure, data quality and complexity. As none of the existing PPDM algorithms can outperform all the others with respect to all the criteria, we discussed the importance of certain metrics for each specific type of PPDM algorithms, and also pointed out the goal of a good metric. There are several future research directions along the way of quantifying a PPDM algorithm and its underneath application or data mining task. One is to develop a comprehensive framework according to which various PPDM al- gorithms can be evaluated and compared. It is also important to design good metrics that can better reflect the properties of a PPDM algorithm, and to de- velop benchmark databases for testing all types of PPDM algorithms. References [1] Agrawal, D., Aggarwal, C.C.: On the design and quantification of pri- vacy preserving data mining algorithms. In: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principle of Data- base System, pp. 247–255. ACM (2001) [2] Agrawal, R., Srikant, R.: Privacy preserving data mining. In: Proceeed- ings of the ACM SIGMOD Conference of Management of Data, pp. 439–450. ACM (2000) [3] Ballou, D., Pazer, H.: Modelling data and process quality in multi input, multi output information systems. Management science 31(2), 150–162 (1985) [4] Bayardo, R., Agrawal, R.: Data privacy through optimal k- anonymization. In: Proc. of the 21st Int’l Conf. on Data Engineering (2005) [5] Bertino, E., Fovino, I.N.: Information driven evaluation of data hiding algorithms. In: 7th Internationa Conference on Data Warehousing and Knowledge Discovery, pp. 418–427 (2005) [6] Bertino, E., Fovino, I.N., Provenza, L.P.: A framework for evaluating pri- vacy preserving data mining algorithms. Data Mining and Knowledge Discovery 11(2), 121–154 (2005) A Survey of Quantification of Privacy Preserving Data Mining Algorithms 203 [7] Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: L. Zayatz, P. Doyle, J. Theeuwes, J. Lane (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 113–134. North- Holland (2002) [8] Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure risks vs. data utility: The R-U confidentiality map. Tech. Rep. 121, National Insti- tute of Statistical Sciences (2001) [9] Dwork, C., Nissim, K.: Privacy preserving data mining in vertically par- titioned database. In: CRYPTO 2004, vol. 3152, pp. 528–544 (2004) [10] Evfimievski, A.: Randomization in privacy preserving data mining. SIGKDD Explor. Newsl. 4(2), 43–48 (2002) [11] Evfimievski, A., Srikant, R., Agrawal, R., Gehrke, J.: Privacy preserving mining of association rules. In: 8th ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 217–228. ACM- Press (2002) [12] Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for informa- tion and privacy preservation. In: Proceedings of the 21st IEEE Inter- national Conference on Data Engineering (ICDE 2005). Tokyo, Japan (2005) [13] Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proc., the Eigth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 279–288 (2002) [14] Kantarcioglu, M., Clifton, C.: Privacy preserving distributed mining of association rules on horizontally partitioned data. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 24–31 (2002) [15] Kantarcıo˘glu, M., Jin, J., Clifton, C.: When do data mining results violate privacy? In: Proceedings of the 2004 ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 599–604. Seattle, WA (2004). [16] Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserv- ing properties of random data perturbation techniques. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03). Melbourne, Florida (2003) [17] Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: Pro- ceedings of the 2006 ACM SIGMOD International Conference on Man- agement of Data, pp. 217–228. ACM Press, Chicago, IL, USA (2006) [18] Kumar Tayi, G., Ballou, D.P.: Examining data quality. Communications of the ACM 41(2), 54–57 (1998) 204 Privacy-Preserving Data Mining: Models and Algorithms [19] Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining 18(1), 92–106 (2006) [20] Nergiz, M.E., Clifton, C.: Thoughts on k-anonymization. In: The Second International Workshop on Privacy Data Management held in conjunction with The 22nd International Conference on Data Engineering. Atlanta, Georgia (2006) [21] Oliveira, S.R.M., Zaiane, O.R.: Privacy preserving frequent itemset min- ing. In: IEEE icdm Workshop on Privacy, Security and Data Mining, vol. 14, pp. 43–54 (2002) [22] Oliveira, S.R.M., Zaiane, O.R.: Privacy preserving clustering by data transformation. In: 18th Brazilian Symposium on Databases (SBBD 2003), pp. 304–318 (2003) [23] Oliveira, S.R.M., Zaiane, O.R.: Toward standardization in privacy pre- serving data mining. In: ACM SIGKDD 3rd Workshop on Data Mining Standards, pp. 7–17 (2004) [24] Rizvi, S., Haritsa, R.: Maintaining data privacy in association rule mining. In: 28th International Conference on Very Large Databases, pp. 682–693 (2002) [25] Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering (TKDE) 13(6), 1010–1027 (2001). [26] Schoeman, F.D.: Philosophical Dimensions of Privacy: An Anthology. Cambridge University Press. (1984) [27] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948) [28] Sweeney, L.: Achieving k-anonymity privacy protection using general- ization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems 10(5), 571–588 (2002) [29] Trottini, M.: A decision-theoretic approach to data disclosure problems. Research in Official Statistics 4, 7–22 (2001) [30] Trottini, M.: Decision models for data disclosure limitation. Ph.D. thesis, Carnegie Mellon University (2003). [31] Vaidya, J., Clifton, C.: Privacy preserving association rule mining in ver- tically partitioned data. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 639–644. ACM Press (2002) [32] Verykios, V.S., Bertino, E., Nai Fovino, I., Parasiliti, L., Saygin, Y., Theodoridis, Y.: State-of-the-art in privacy preserving data mining. SIG- MOD Record 33(1), 50–57 (2004) A Survey of Quantification of Privacy Preserving Data Mining Algorithms 205 [33] Walters, G.J.: Human Rights in an Information Age: A Philosophical Analysis, chap. 5. University of Toronto Press. (2001) [34] Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12(4), 5–34 (1996) [35] Willenborg, L., De Waal, T.: Elements of statistical disclosure control, Lecture Notes in Statistics, vol. 155. Springer (2001) Chapter 9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods Ming Hua Simon Fraser University School of Computing Science 8888 University Drive, Burnaby, BC, Canada V5A 1S6 mhua@cs.sfu.ca Jian Pei Simon Fraser University School of Computing Science 8888 University Drive, Burnaby, BC, Canada V5A 1S6 jpei@cs.sfu.ca Abstract As a serious concern in data publishing and analysis, privacy preserving data processing has received a lot of attention. Privacy preservation often leads to information loss. Consequently, we want to minimize utility loss as long as the privacy is preserved. In this chapter, we survey the utility-based privacy preser- vation methods systematically. We first briefly discuss the privacy models and utility measures, and then review four recently proposed methods for utility- based privacy preservation. We first introduce the utility-based anonymization method for maximiz- ing the quality of the anonymized data in query answering and discernability. Then we introduce the top-down specialization (TDS) method and the progres- sive disclosure algorithm (PDA) for privacy preservation in classification prob- lems. Last, we introduce the anonymized marginal method, which publishes the anonymized projection of a table to increase the utility and satisfy the privacy requirement. Keywords: Privacy preservation, data utility, utility-based privacy preservation, k-anonymity, sensitive inference, l-diversity. 208 Privacy-Preserving Data Mining: Models and Algorithms 9.1 Introduction Advanced analysis on data sets containing information about individuals poses a serious threat to individual privacy. Various methods have been pro- posed to tackle the privacy preservation problem in data analysis, such as anonymization and perturbation. The major goal is to protect some sensitive individual information (privacy) from being identified by the published data. For example, in k-anonymization, certain individual information is generalized or suppressed so that any individual in a released data set is indistinguishable from other k − 1 individuals. A natural consequence of privacy preservation is the information loss. For example, after the k-anonymization, the information describing an individual should be the same as at least other k − 1 individuals. The loss of the spe- cific information about certain individuals may affect the data quality. In the extreme case, the data may become totally useless. Example 9.1 (Utility loss in privacy preservation) Table 9.1a is a data set used for customer analysis. Among the listed attributes, {Age, Ed- ucation, Zip Code} can be used to uniquely identify an individual. Such a set of attributes is called a quasi-identifier. Annual Income is a sensitive attribute. Target Customer is the class label of customers. In order to protect the annual income information for individuals, sup- pose 2-anonymity is required so that any individual is indistinguishable from another one on the quasi-identifier. Table 9.2b and 9.3c are both valid 2- anonymizations of 9.1a. The tuples sharing the same quasi-identifier have the same gId. However, Table 9.2b provides more accurate results than Table 9.3c in answering the following two queries. Q1: “How many customers under age 29 are there in the data set?” Q2:“Is an individual with age =25, Education = Bachelor, Zip Code = 53712 a target customer?” According to Table 9.2b, the answers of Q1 and Q2 are “2” and “Y”, re- spectively. But according to Table 9.3c, the answer to Q1 is an interval [0, 4], because 29 falls in the age range of tuple t1,t2,t4, and t6. The answer to Q2 is Y and N with 50% probability each. From this example, we make two observations. First, different anonymiza- tion may lead to different information loss. Table 9.2b and 9.3c are in the same anonymization level, but Table 9.2b provides more accurate answers to the queries. Therefore, it is crucial to minimize the information loss in privacy preservation. Second, the data utility depends on the applications using the data. In the above example, Q1 is an aggregate query, thus the data is more useful if the attribute values are more accurate. Q2 is a classification query, so the utility Utility-based Privacy-Preserving Data Transformation Methods 209 Table 9.1a. The original table tId Age Education Zip Code Annual Income Target Customer t1 24 Bachelor 53711 40k Y t2 25 Bachelor 53712 50k Y t3 30 Master 53713 50k N t4 30 Master 53714 80k N t5 32 Master 53715 50k N t6 32 Doctorate 53716 100k N Table 9.2b. A 2-anonymized table with better utility gId tId Age Education Zip Code Annual Income Target Customer g1 t1 [24-25] Bachelor [53711-53712] 40k Y g1 t2 [24-25] Bachelor [53711-53712] 50k Y g2 t3 30 Master [53713-53714] 50k N g2 t4 30 Master [53713-53714] 80k N g3 t5 32 GradSchool [53715-53716] 50k N g3 t6 32 GradSchool [53715-53716] 100k N Table 9.3c. A 2-anonymized table with poorer utility gId tId Age Education Zip Code Annual Income Target Customer g1 t1 [24-30] ANY [53711-53714] 40k Y g2 t2 [25-32] ANY [53712-53716] 50k Y g3 t3 [30-32] Master [53713-53715] 50k N g1 t4 [24-30] ANY [53711-53714] 80k N g3 t5 [30-32] Master [53713-53715] 50k N g2 t6 [25-32] ANY [53712-53716] 100k N of data depends on how much the classification model is preserved in the anonymized data. In a word, utility is the quality of data for the intended use. 9.1.1 What is Utility-based Privacy Preservation? The utility-based privacy preservation has two goals: protecting the private information and preserving the data utility as much as possible. Privacy preser- vation is a hard requirement, that is, it must be satisfied, and utility is the mea- sure to be optimized. While privacy preservation has been extensively studied, the research of utility-based privacy preservation has just started. The chal- lenges include: 210 Privacy-Preserving Data Mining: Models and Algorithms Utility measure. One key issue in the utility-based privacy preservation is how to model the data utility in different applications. A good utility measure should capture the intrinsic factors that affect the quality of data for the specific application. Balance between utility and privacy. In some situation, preserving utility and privacy are not conflicting. But more often than not, hiding the privacy information may have to sacrifice some utility. How do we trade off between the two goals? Efficiency and scalability. The traditional privacy preservation is already computational challenging. For example, even simple restriction of optimized k-anonymity is NP-hard [3]. How do we develop efficient algorithms if utility is involved? Moreover, real data sets often contains millions of high dimen- sional tuples, highly scalable algorithms are needed. Ability to deal with different types of attributes. Real life data often in- volve different types of attributes, such as numerical, categorical, binary or mixtures of these data types. The utility-based privacy preserving methods should be able to deal with attributes of different types. 9.2 Types of Utility-based Privacy Preservation Methods In this section, we introduce some common privacy models and recently proposed data utility measures. 9.2.1 Privacy Models Various privacy models have been proposed in literature. This section intro- duces some of the privacy models that are often used as well as the correspond- ing privacy preserving methods. K-Anonymity. K-anonymity is a privacy model developed for the linking attack [18]. Given a table T with attributes (A1,...,An),aquasi-identifier is a minimal set of attributes (Ai1 ,...,Ail )(1≤ i1 < ... < il ≤ n) in T that can be joined with external information to re-identify individual records. Note that there may be more than one quasi-identifer in a table. AtableT is said k-anonymous given a parameter k and the quasi-identifer QI =(Ai1 ,...,Ail ) if for each tuple t ∈ T, there exist at least another (k−1) tuples t1,...,tk−1 such that those k tuples have the same projection on the quasi-identifier. Tuple t and all other tuples indistinguishable from t on the quasi-identifier form an equivalence class. Utility-based Privacy-Preserving Data Transformation Methods 211 Given a table T with the quasi-identifier and a parameter k, the problem of k-anonymization is to compute a view T  that has the same attributes as T such that T  is k-anonymous and as close to T as possible according to some quality metric. Data suppression and value generalization are often used for anonymization. Suppression is masking the attribute value with a special value in the domain. Generalization is replacing a specific value with a more generalized one. For example, the actual age of an individual can be replaced by an interval, or the city of an individual can be replaced by the corresponding province. Cer- tain quality measures are often used in the anonymization, such as the average equivalence class size. Theoretical analysis shows that the problem of opti- mal anonymization under many quality models is NP-hard [1, 14, 3]. Various k-anonymization methods are proposed [19, 20, 29, 12, 11]. One of the most important advantages of k-anonymity is that no additional noise or artificial perturbation is added into the original data. All the tuples in an anonymized data remains trustful. l-Diversity. l-diversity [13] is based on the observation that if the sensi- tive values in one equivalence class lacks diversity, then no matter how large the equivalence class is, attacker may still guess the sensitive value of an indi- vidual with high probability. For example, Table 9.3c is a 2-anonymous table. Particularly, t3 and t5 are generalized into the same equivalence class. How- ever, since their annual income is the same, an attacker can easily conclude that the annual income of t3 is 50k although the 2-anonymity is preserved. Ta- ble 9.2b has better diversity in the sensitive attribute. t3 and t4 are in the same equivalence class and their annual income is different. Therefore, the attacker only have a 50% opportunity to know the real annual income of t3. l-diversity model addresses the above problem. By intuition, a table is l-diverse if each equivalence class contains at least l “well represented” sensitive values, that is, at least l most frequent values have very similar frequencies. Consider a table T =(A1,...,An,S) and constant c and l, where (A1,...,An) is a quasi-identifier and S is a sensitive attribute. Sup- pose an equivalence class EC contains value s1,...,sm with frequency f(s1),...,f(sm)(appearing in the frequency non-ascending order) on sen- sitive attribute S, EC satisfies (c, l)-diversity with respect to S if f(s1) 100k” should not be dis- closed. There are two inference rules {[20 − 30],Bachelor}→“ ≤ 50k”and {Doctorate, Lawyer}→“ > 100k” with high confidence. Table 9.10b is a suppressed table where the confidence of each inference rule is reduced to 50% or below. But the table remains useful for classification. That is, given a tuple t with the same values on attribute Age, Education,andJob as any tuple t in the original table, t receives the same class label as t with a high proba- bility according to Table 9.10b. This is because the class label in Table 9.9a is highly related to attribute Job. As long as the values on Job are disclosed, the classification accuracy is guaranteed. To give more details about the method, we first introduce how to define the privacy requirement using privacy templates, and then discuss the utility measure. Last, we use an example to illustrate the algorithm. Privacy Template. To make a table free from sensitive infer- ences, it is required that the confidence of each inference rule is low. Utility-based Privacy-Preserving Data Transformation Methods 225 Table 9.9a. The original table tId Age Education Job AnnualIncome TargetCustomer 1 [20-30] Bachelor Engineer ≤ 50k Y 2 [20-30] Bachelor Artist ≤ 50k N 3 [20-30] Bachelor Lawyer ≤ 50k Y 4 [20-30] Bachelor Artist [50k − 100k]N 5 [20-30] Master Artist [50k − 100k]N 6 [31-40] Master Engineer [50k − 100k]Y 7 [20-30] Doctorate Lawyer > 100k N 8 [31-40] Doctorate Lawyer > 100k Y 9 [31-40] Doctorate Lawyer [50k − 100k]Y 10 [20-30] Doctorate Engineer [50k − 100k]N Table 9.10b. The suppressed table tId Age Education Job AnnualIncome TargetCustomer 1 [20-30] ⊥Edu Engineer ≤ 50k Y 2 [20-30] ⊥Edu Artist ≤ 50k N 3 [20-30] ⊥Edu Lawyer ≤ 50k Y 4 [20-30] ⊥Edu Artist [50k − 100k]N 5 [20-30] Master Artist [50k − 100k]N 6 ⊥Age Master Engineer [50k − 100k]Y 7 [20-30] ⊥Edu Lawyer > 100k N 8 ⊥Age ⊥Edu Lawyer > 100k Y 9 ⊥Age ⊥Edu Lawyer [50k − 100k]Y 10 [20-30] ⊥Edu Engineer [50k − 100k]N Templates can be used to specify such a requirement. Consider table T = (M1,...,Mm,Π1,...,Πn, Θ),whereMj (1 ≤ j ≤ m)isanon-sensitive attribute,Πi (1 ≤ i ≤ n)isasensitive attribute,andΘ is a class label at- tribute. A template is defined as IC → πi,h,whereπi is a sensitive attribute value from sensitive attribute Πi, IC is a set of attributes not containing Πi and called inference channel,andh is a confidence threshold. An inference is an instance of IC → πi,h, which has the form ic → πi,whereic con- tains values from attributes in IC. The confidence of inference ic → πi, denoted by conf(ic → πi), is the percentage of tuples containing both ic and πi among the tuples containing ic.Thatis,conf(ic → πi)= |Ric,πi | |Ric| , where Rv denotes the tuples containing value v. The confidence of a template is defined as the maximum confidence of all the inferences of the template. That is, Conf(IC → πi)=maxconf(ic → πi).TableT satisfies template IC → πi,h if Conf(IC → πi) ≤ h.T satisfies a set of templates if T satisfies each template in the set. 226 Privacy-Preserving Data Mining: Models and Algorithms Progressive Disclosure and Utility Measure. As discussed in Section 9.2, suppression is an efficient method for eliminating sensitive inferences. Con- sider table T =(M1,...,Mm,Π1,...,ΠN, Θ).Mj (1 ≤ j ≤ m)isanon- sensitive attribute and Πi (1 ≤ i ≤ n)isasensitive attribute.Θ is a class label. The suppression of a value on attribute Mj is to replace all occurrences of this value by a special value ⊥j. For each template IC → πi,h not sat- isfied in T, some values in the inference channel IC should be suppressed so that Conf(IC → πi) is reduced to not greater than h. Disclosure is the opposite operation of suppression. Given a suppressed ta- ble T, Supj denotes all values suppressed on attribute Mj.Adisclosure of value v ∈ Supj replaces the special value ⊥j with v in all the tuples that cur- rently contain ⊥j but originally contain v. A disclosure is valid if it does not lead to a template violation. Moreover, a disclosure on attribute Mj is benefi- cial, that is, it increases the information utility for classification, if more than one class is involved in the tuples containing ⊥j. The following utility score measures the benefit of a disclosure quantitatively. For each suppressed attribute value v , Score(v) is defined as Score(v)= InfoGain(v) PrivLoss(v)+1 where InfoGain(v) is the information gain in disclosing value v and PrivLoss(v) is the privacy loss in disclosing value v, which are defined as follows. InfoGain(v). Given a set of tuples S and the class labels cls involved in S, the entropy is defined as H(S)=− c∈cls freq(S, c) |S| × log2 freq(S, c) |S| where freq(S, c) is the number of tuples containing class c in S. Given value v on attribute Mj, the tuples containing v is denoted by Rv. Suppose R⊥j is the set of tuples having suppressed value on Mj before dis- closing v, the information gain of disclosing v is InfoGain(v)=H(R⊥j ) − ( |Rv| |R⊥j |H(Rv)+ |R⊥j − Rv| |R⊥j | H(R⊥j − Rv)) PrivLoss(v). Given value v on attribute Mj, the privacy loss PrivLoss(v) is defined as the average confidence increase of inferences. PrivLoss(v)=AV GMj ∈IC{Conf(IC → πi) − Conf(IC → πi)} Utility-based Privacy-Preserving Data Transformation Methods 227 where Conf(IC → πi) and Conf(IC → πi) are the confidence before and after disclosing v. Example 9.8 (Utility score) Consider Table 9.9a. Suppose the pri- vacy templates are {Age, Education}→“ ≤ 50k”, 50% {Education,Job}→“ > 100k”, 50% Suppose at first all the values on attribute Job is suppressed to ⊥Job. The score of disclosing value Engineer on Job is calculated as follows. H(R⊥Job)=− 5 10 × log2 5 10 − 5 10 × log2 5 10 =1 H(REngineer)=− 2 3 × log2 2 3 − 1 3 × log2 1 3 =0.9149 H(R⊥Job − REngineer)=− 3 7 × log2 3 7 − 4 7 × log2 4 7 =0.9852 InfoGain(Engineer)=H(R⊥Job) − ( 3 10 × H(REngineer)+ 7 10 × H(R⊥Job − REngineer)) = 0.03589 Before disclosing Engineer: Conf({Education,Job}→> 100k) = conf({⊥Education, ⊥Job}→> 100k)=0.2 After disclosing Engineer: conf({⊥Education, Engineer}→> 100k)=0 conf({⊥Education, ⊥Job}→> 100k)=0.286 Conf({Education,Job}→> 100k)=max{0, 0.286} =0.286 PrivLoss(Engineer)=0.286 − 0.2=0.086 Score(Engineer)= 0.03589 0.086+1 =0.033 The Algorithm. The Progressive Disclosure Algorithm first suppresses all non-sensitive attribute values in a table and then iteratively discloses the at- tribute values that are helpful for classification without violating privacy tem- plates. In each iteration, the score of each suppressed value is calculated and the one with the maximum score is disclosed. The iteration terminates when there is no valid and beneficial disclosure. The algorithm is illustrated using the following example. Example 9.9 (The Progressive disclosure algorithm) Consider the following templates on Table 9.9a. (1) {Age, Education}→“ ≤ 50k”, 50% (2) {Education,Job}→“ > 100k”, 50% At first, the values on attribute Age, Education, and Job are suppressed to ⊥Age, ⊥Education, and ⊥Job, respectively. The candidate disclosing values include [20 − 30],[31 − 40], Bachelor, Master, Doctorate, Engineer, 228 Privacy-Preserving Data Mining: Models and Algorithms Artist, and Lawyer. In order to find the most beneficial disclosure, the score of each value is calculated. Since Artist has the maximum score, it is disclosed in this iteration. At the next iteration, the scores of the rest candidates are updated, and the one with the maximum score, [20 − 30], is disclosed. All the valid and beneficial disclosures are executed similarly in the rest iterations. The finally published table is shown in Table 9.10b. Note that in the finally published table, Bachelor and Doctorate are suppressed because disclosing them violates the privacy templates; [31−40] is suppressed because disclosing it is not beneficial. 9.4.3 Summary and Discussion The top-down specialization (TDS) method and the progressive disclosure algorithm (PDA) are based on the observation that the goals of privacy preser- vation and classification modeling may not be always conflicting. Privacy preservation is to hide the sensitive individual (specific) information, while classification modeling draws the general structure of the data. TDS and PDA try to achieve the “win-win” goal that the specific information hidden for pri- vacy preservation is the information misleading or not useful for classification. Therefore, the quality of the classification model built on the table after using TDS or PDA may be even better than that built on the original table. 9.5 Anonymized Marginal: Injecting Utility into Anonymized Data Sets One drawback of the anonymization method is that after the generalization on quasi-identifiers, the distribution of the more specific data is lost. For ex- ample, consider Table 9.11a and the corresponding 2-anonymous Table 9.12b. After the anonymization, all the values on attribute Age are generalized to the full range in the domain without any specific distribution information. How- ever, if we publish Table 9.13a in addition to Table 9.12b, more information about Age is published and the 2-anonymity is still guaranteed. Table 9.13a is called a marginal on Age. On the other hand, not all marginals preserve privacy. For example, Ta- ble 9.14b satisfies 2-anonymity itself, but if an attacker knows an individual living in 53715 with Doctorate degree is in the original table, he/she may link the information from Table 9.12b and 9.14b together and conclude that the annual income of the individual is 80k. Based on the above observation, [8] models the utility of anonymized tables as how much they preserve the distribution of the original table. It then pro- poses to publish more than one anonymized tables to better approximate the original distribution. Utility-based Privacy-Preserving Data Transformation Methods 229 Table 9.11a. The original table tId Age Education Zip Code AnnualIncome 1 27 Bachelor 53711 40k 2 28 Bachelor 53713 50k 3 27 Master 53715 50k 4 28 Doctorate 53716 80k 5 30 Master 53714 50k 6 30 Doctorate 53712 100k Table 9.12b. The anonymized table gId tId Age Education Zip Code AnnualIncome 1 1 [27-30] Bachelor [53711-53713] 40k 1 2 [27-30] Bachelor [53711-53713] 50k 2 3 [27-30] GradSchool [53715-53716] 40k 2 4 [27-30] GradSchool [53715-53716] 80k 3 5 [27-30] GradSchool [53712-53714] 50k 3 6 [27-30] GradSchool [53712-53714] 100k Table 9.13a. Age Marginal Age Count 27 2 28 2 30 2 Table 9.14b. (Education, AnnualIncome) Marginal Education AnnualIncome Count Bachelor 40k 1 Bachelor 50k 1 Master 50k 2 Doctorate 80k 1 Doctorate 100k 1 Now the problem becomes, which additional anonymized tables should be published and how to check the privacy if more than one anonymized table are released. First of all, we introduce the concept of anonymized marginal and the utility measure to evaluate the quality of a set of anonymized marginals. 9.5.1 Anonymized Marginal Consider a table T =(A1,...,An).{Ai1 ,...,Aim }(1 ≤ i1 < ... < im ≤ n) is a set of attributes in T. A marginal table TAi1 ,...,Aim can be created by the following SQL statement. (Attribute Count is the number of tuples in TAi1 ,...,Aim sharing the same values on Ai1 ,...,Aim ). 230 Privacy-Preserving Data Mining: Models and Algorithms CREATE TABLE TAi1 ,...,Aim AS (SELECT Ai1 ,...,Aim , COUNT(∗)AS Count FROM T GROUP BY Ai1 ,...,Aim ) The marginal table indicates the distribution of the tuples from T in domain D(Ai1 )×...×D(Aim ),whereD(Ai) is the domain of attribute Ai.Amarginal is anonymized if some of its attribute values are generalized. 9.5.2 Utility Measure Distribution is an intrinsic characteristic of a data set. Many data analysis discover the patterns from data distribution, such as classification which dis- covers the class distribution in a data set. Therefore, whether the distribution of a data set is preserved after anonymization is crucial for the utility of data. In this spirit, a utility measure is defined as the difference between the distribution of the original data and that of the anonymized data. Empirical distribution of the original table. Consider a table T = (A1,...,Am). In the probabilistic view, the tuples in T can be considered as an i.i.d. (identically and independently distributed) sample generated from an underlying distribution F. Reversely, F can be estimated using the empirical distribution FT of T. Given any instance x =(x1,...,xm) in the domain of T, the empirical probability pT(x) is the posteriori probability of x in table T. In other words, pT(x) is the proportion of tuples in T having the same attribute values as x,thatis,pT(x)=|{t|t∈T,t.Ai=xi, 1≤i≤m}| |T| . Maximum entropy probability distribution of anonymized marginals. Similarly, the anonymized marginals of T can be viewed as a set of constraints on the underlying distribution. For example, Age Marginal in Table 9.13a in- dicates that 33.3% of the tuples in Table 9.11a have age 27, 28 and 30, respec- tively. Given a set of constraints, the maximum entropy probability distribution is the distribution that maximizes the entropy among all the probability distri- butions satisfying the constraints. It is often used to estimate the underlying distribution given some constraints. The intuition is that, by maximizing the entropy, the prior knowledge about the distribution is minimized. Consider a table T =(A1,...,Am) and a set of marginals M = {M1,...,Mn}, each marginal Mi =(Ai1 ,...,Aik ,Count)(1≤ i1 < ... < ik ≤ m) contains the projection of T on attribute {Ai1 ,...,Aik } and the count of tuples. A distribution F satisfies Mi if for any instance t in Mi,the Utility-based Privacy-Preserving Data Transformation Methods 231 probability in F satisfies ΠAi1 ,...,Aik x=t p(x)=t.Count |T| where x is an instance from the domain of T and ΠAi1 ,...,Aik x = t means that the projection of x on Ai1 ,...,Aik is the same as t. The above equation means that the projection of distribution F on Ai1 ,...,Aik isthesameasthe empirical distribution of Mi.F satisfies a set of marginals M if F satisfies each marginal Mi in M. The maximum entropy probability distribution FM is the distribution with the maximum entropy in all the distributions satisfying M. Kullback-Leibler divergence (KL-divergence). Suppose the empirical distribution of a table T is F1 and the maximum entropy probability distrib- ution of the anonymized marginals M is F2,theKullback-Leibler divergence (KL-divergence) [9] is used to measure the difference between the two distrib- utions. (Note that KL-divergence is not a metric.) DKL( F1, F2)= i p(1) i log p(1) i p(2) i = H( F1, F2) − H( F1) where p(1) i and p(2) i are the probabilities of an instance from distribution F1 and F2, respectively. H( F1) is the entropy of F1, which measures how much effort it needs to identify an instance from distribution F1.H( F1, F2) is the cross-entropy of F1 and F2, which measures the effort needed to identify an instance from distribution F1 and F2. A smaller KL-divergence indicates that the two distributions are more similar. KL-divergence is non-negative and it is minimized when F1 = F2. Given a table T, the entropy H( F1) is constant. Therefore, minimizing DKL( F1, F2) is mathematically equivalent to minimiz- ing H( F1, F2). Therefore, the utility of a set of anonymized marginals M = {M1,...,Mn} can be measured by the KL-divergence between FM and FT. A smaller KL- divergence value indicates better utility of M. 9.5.3 Injecting Utility Using Anonymized Marginals Based on the above utility measure, ideally, we want to search all the pos- sible sets of anonymized marginals and find the one with the minimum KL- divergence. There are two challenges. Calculating the KL-divergence is computational challenging. First, gen- erating all the possible sets of marginals needs exhaustive search. Second, 232 Privacy-Preserving Data Mining: Models and Algorithms finding the optimal k-anonymization for a single marginal is already NP- hard [3]. Third, given a set of constraints, calculating the maximum entropy probability distribution requires iterative algorithms [4, 16], which may be time-consuming. Since there is a close-form algorithm [10] to compute the maximum en- tropy probability distribution on decomposable tables, the anonymized mar- ginal method restricts the search to only including decomposable marginals. The concept of decomposable marginals is derived from the decomposable graphical model [10]. If a set of marginals are decomposable, then they are conditionally independent. Instead of giving the formal definition, we use the following example to illustrate the decomposable marginals and how to calcu- late the maximum entropy probability on decomposable marginals. A B C D Figure 9.3. Interactive graph B C DA B C Figure 9.4. A decomposition Example 9.10 (Decomposable Marginal) Consider a set of mar- ginals M1 =(A, B, C, Count) and M2 =(B,C,D,Count).Wecreatean interactive graph (Figure 9.3) for them by generating a vertex for each at- tribute. An edge between two vertices is created if the corresponding attributes are in the same marginal. M1 and M2 are decomposable because they satisfies the following two conditions: (1) in the corresponding interactive graph, clique BC separates A and D(the two components after the decomposition are shown in Figure 9.4); (2) each maximal clique in the interactive graph is covered by a marginal. An example of non-decomposable marginals is M1 =(A, B, C, Count), M2 =(B,D,Count) and M3 =(C, D, Count). They have the same interac- tive graph as shown in Figure 9.3, but the maximal clique BCD is not covered by any marginal. Therefore, they are not decomposable marginals. A set of decomposable marginals can be viewed as a set of condition- ally independent relations. For example, attributes A and D in marginals M1 =(A, B, C, Count) and M2 =(B,C,D,Count) are independent given attributes BC. The calculation of the maximum entropy probability distribu- tion for decomposable marginals is illustrated in the following example. Example 9.11 (Maximum entropy probability) Consider mar- ginals M1 =(A, B, Count), and M2 =(B,C,Count) of table Utility-based Privacy-Preserving Data Transformation Methods 233 T =(A, B, C).M1 and M2 are decomposable and B separates A and C. Therefore, attribute A and C are independent given B. If M1 and M2 are ordinary marginals: The attribute values in ordinary marginals are not generalized. For any instance x =(a, b, c) in the domain of T, the maximum entropy probability of x is p(x)=p(a, b, c) = p(a, c|b)· p(b) = p(a|b)· p(c|b)· p(b) = p(a,b)·p(b,c) p(b) where p(a, b) is the proportion of tuples in M1 having value a and b on at- tribute A and B, respectively. If M1 and M2 are anonymized marginals: Some attribute values in anonymized marginals are generalized. For any instance x =(a, b, c) in the domain of T, suppose a,b,c are the corresponding generalized values in M1 and M2. The maximum entropy probability of x is: p(x)=p(a, b, c)=p(a,b)·p(b,c) p(b)· 1 |Ra |·|Rb |·|Rc| where p(a,b) is the fraction of tuples having value a and b on attribute A and B in M1, respectively. Ra is the set of tuples having value a on A in M1. Since finding all the possible decomposable marginals requires exhaustive search, a search algorithm like genetic algorithm or random walk is needed. Guarantee the privacy. Another challenge is that given a set of marginals {M1,...,Mn}, how to check whether the information obtained from combin- ing {M1,...,Mn} satisfies k-anonymity and l-diversity? The theoretical results in [8] show that in order to check k-anonymity of a set of decomposable marginals {M1,...,Mn}, we only need to check whether each marginal Mi satisfies k-anonymity. But checking whether {M1,...,Mn} satisfies l-diversity is more difficult. We have to join all the marginals together and test whether the joined table satisfies l-diversity. Several propositions help reduce the computation. First, if there is one marginal that violates l-diversity, then the whole set of marginals violate l- diversity. Second, only the marginals containing sensitive attributes need to be joined together to check for l-diversity. Third, if a subset of marginals do not satisfy l-diversity, then the whole set of marginals do not satisfy l-diversity. 9.5.4 Summary and Discussion Anonymized marginal is very effective in improving the utility of the anonymized data. However, searching all the possible decomposable marginals for the optimal solution requires a lot of computation. A simpler yet effective 234 Privacy-Preserving Data Mining: Models and Algorithms method is, given table T, first compute an traditional k-anonymous table T , and then create a set of anonymous marginals M containing single attribute from T. Experimental results [8] show that publishing T  together with M still dramatically decreases the KL-divergence. 9.6 Summary Utility-based privacy preserving methods are attracting more and more at- tention. However, the concept of utility is not new in privacy preservation prob- lems. Utility is often used as one of the criteria for the privacy preserving meth- ods [21] and measures the information loss after using the privacy preservation technique on data sets. Then, what makes the utility-based privacy preservation methods special? Traditional privacy preserving methods often do not make explicit assumptions about the applications where the data are used. Therefore, the utility measure is often very general and thus not so effective. For example, traditionally, in the sensitive inference privacy model, the utility is often considered maximal if the number of suppressed entries is minimized. It is true only for certain applica- tions. As a comparison, the utility-based privacy preservation methods target at a class of applications based on the same data utility. Therefore, the devel- oped methods are effective in reducing the information loss for the intended applications while preserving privacy as well. In addition to the four methods discussed in this chapter, there are many applications which utilize some special functions of data. How to extend the utility-based privacy preserving methods to various applications is highly inter- esting. For example, in the data set where ranking queries are usually issued, the utility of data should be measured as how much the dominance relation- ship among tuples is preserved. None of the existing models can handle this problem. Moreover, the utility-based privacy preserving methods can also be extended to other types of data, such as stream data where the temporal char- acteristics are considered more important in analysis. Acknowledgements This work is supported in part by the NSERC Grants 312194-05 and 614067, and an IBM Faculty Award. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. References [1] Charu C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 901–909, August 2005. Utility-based Privacy-Preserving Data Transformation Methods 235 [2] Charu C. Aggarwal, Jian Pei, and Bo Zhang. On privacy preserva- tion against adversarial data mining. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 510 – 516. ACM Press, 2006. [3] Roberto J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 217 – 228. IEEE Computer Society, 2005. [4] A.L. Berger, S.A. Della-Pietra, and V.J. Della-Pietra. A maximum en- tropy approach to natural language processing. Computational Linguis- tics, 22(1):39–71, 1996. [5] Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. Top-down specializa- tion for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), volume 00, pages 205 – 216. IEEE Computer Society, 2005. [6] Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. Anonymizing classi- fication data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering, 19(5):711–725, May 2007. [7] Vijay S. Iyengar. Transforming data to satisfy privacy constraints. In Pro- ceedings of the eighth ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 279 – 288. ACM Press, 2002. [8] Daniel Kifer and Johannes Gehrke. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD international con- ference on Management of data, pages 217 – 228. ACM Press, 2006. [9] S. Kullback and R. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–87, 1951. [10] Steffen L. Lauritzen. Graphical Models. Oxford Science Publicatins, 1996. [11] F. Giannotti M. Atzori, F. Bonchi and D. Pedreschi. Blocking anonymity threats raised by frequent itemset mining. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), November 2005. [12] F. Giannotti M. Atzori, F. Bonchi and D. Pedreschi. k-anonymous pat- terns. In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’05), volume 3721 of Lecture Notes in Computer Science, Springer, Porto, Portugal, October 2005. [13] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthu- ramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), page 24, 2006. 236 Privacy-Preserving Data Mining: Models and Algorithms [14] Adam Meyerson and Ryan Williams. On the complexity of optimal k- anonymity. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, pages 223–228, June 2004. [15] Stanley R. M. Oliveira and Osmar R. Za¨ıane. Privacy preserving frequent itemset mining. In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining, pages 43–54, Dar- linghurst, Australia, Australia, 2002. Australian Computer Society, Inc. [16] Adwait Ratnaparkhi. A maximum entropy part-of-speech tagger. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133–142, University of Pennsylvania, May 1996. ACL. [17] P. Samarati. Protecting respondents’ identities in microdata re- lease. IEEE Transactions on Knowledge and Data Engineering, 13(6): 1010 – 1027, November 2001. [18] Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information. Technical report, March 1998. [19] Latanya Sweeney. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):571–588, 2002. [20] Latanya Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, 2002. [21] Vassilios S. Verykios, Elisa Bertino, Igor Nai Fovino, Loredana Parasil- iti Provenza, Yucel Saygin, and Yannis Theodoridis. State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33(1):50 – 57, 2004. [22] Vassilios S. Verykios, Ahmed K. Elmagarmid, Elisa Bertino, Yucel Say- gin, and Elena Dasseni. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4):434–447, 2004. [23] Ke Wang, Benjamin C. M. Fung, and Philip S. Yu. Template-based pri- vacy preservation in classification problems. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 466 – 473. IEEE Computer Society, 2005. [24] Ke Wang, Philip S. Yu, and Sourav Chakraborty. Bottom-up general- ization: A data mining solution to privacy protection. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04), volume 00, pages 249 – 256. IEEE Computer Society, 2004. [25] Xiaokui Xiao and Yufei Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In To appear in ACM Conference on Management of Data (SIGMOD), 2007. Utility-based Privacy-Preserving Data Transformation Methods 237 [26] Xiaokui Xiao and Yufei Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the 32nd international conference on Very large data bases, volume 32, pages 139 – 150. VLDB Endowment, 2006. [27] Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. Utility-based anonymization for privacy preservation with less information loss. ACM SIGKDD Explorations Newsletter, 8(2):21– 30, December 2006. [28] Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai- Chee Fu. Utility-based anonymization using local recoding. In Proceed- ings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 785 – 790. ACM Press, 2006. [29] Sheng Zhong, Zhiqiang Yang, and Rebecca N. Wright. Privacy- enhancing k-anonymization of customer data. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Princi- ples of database systems(PODS ’05), pages 139–147, New York, NY, USA, 2005. ACM Press. Chapter 10 Mining Association Rules under Privacy Constraints Jayant R. Haritsa Database Systems Lab Indian Institute of Science, Bangalore 560012, INDIA haritsa@dsl.serc.iisc.ernet.in Abstract Data mining services require accurate input data for their results to be mean- ingful, but privacy concerns may impel users to provide spurious information. In this chapter, we study whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable de- gree of certainty, violate their privacy. Our analysis is in the context of extracting association rules from large historical databases, a popular mining process that identifies interesting correlations between database attributes. We analyze the various schemes that have been proposed for this purpose with regard to a vari- ety of parameters including the degree of trust, privacy metric, model accuracy and mining efficiency. Keywords: Privacy, data Mining, association rules. 10.1 Introduction The knowledge models produced through data mining techniques are only as good as the accuracy of their input data. One source of data inaccuracy is when users deliberately provide wrong information. This is especially com- mon with regard to customers who are asked to provide personal information on Web forms to e-commerce service providers. The compulsion for doing so may be the (perhaps well-founded) worry that the requested information may be misused by the service provider to harass the customer. As a case in point, consider a pharmaceutical company that asks clients to disclose the diseases they have suffered from in order to investigate the correlations in their occur- rences – for example, “Adult females with malarial infections are also prone to contract tuberculosis”. While the company may be acquiring the data solely for genuine data mining purposes that would eventually reflect itself in better 240 Privacy-Preserving Data Mining: Models and Algorithms service to the client, at the same time the client might worry that if her med- ical records are either inadvertently or deliberately disclosed, it may adversely affect her future employment opportunities. In this chapter, we study whether customers can be encouraged to provide correct information by ensuring that the mining process cannot, with any rea- sonable degree of certainty, violate their privacy, but at the same time produce sufficiently accurate mining results. The difficulty in achieving these goals is that privacy and accuracy are typically contradictory in nature, with the con- sequence that improving one usually incurs a cost in the other [3]. A related issue is the degree of trust that needs to be placed by the users in third-party intermediaries. And finally, from a practical viability perspective, the time and resource overheads imposed on the data mining process due to supporting the privacy requirements. Our study is carried out in the context of extracting association rules from large historical databases [7], an extremely popular mining process that identi- fies interesting correlations between database attributes, such as the one de- scribed in the pharmaceutical example. By the end of the chapter, we will attempt to show that the state-of-the-art is such that it is indeed possible to simultaneously achieve all the desirable objectives (i.e. privacy, accuracy, and efficiency) for association rule mining. In the above discussion, and for the most part in this chapter, the focus is on maintaining the confidentiality of the input user data. However, it is also conceivable to think of the complementary aspect of maintaining output se- crecy, that is, the privacy of sensitive association rules that are an outcome of the mining process – a summary discussion on these techniques is included in our coverage of the literature. 10.2 Problem Framework In this section, we describe the framework of the privacy mining problem in the context of association rules. 10.2.1 Database Model We assume that the original (true) database U consists of N records, with each record having M categorical attributes. Note that boolean data is a spe- cial case of this class, and further, that continuous-valued attributes can be converted into categorical attributes by partitioning the domain of the attribute into fixed length intervals. The domain of attribute j is denoted by Sj U, resulting in the domain SU of a record in U being given by SU = M j=1 Sj U. We map the domain SU to the index set IU = {1,...,|SU |}, thereby modeling the database as a set of N Mining Association Rules under Privacy Constraints 241 values from IU. If we denote the ith record of U as Ui,thenU = {Ui}N i=1,Ui ∈ IU. To make this concrete, consider a database U with 3 categorical attributes Age, Sex and Education having the following category values: Age Child, Adult, Senior Sex Male, Female Education Elementary, Graduate For this schema, M =3,S1 U ={Child, Adult, Senior},S2 U ={Male, Female}, S3 U ={Elementary, Graduate},SU = S1 U × S2 U × S3 U, |SU | =12. The domain SU is indexed by the index set IU = {1, ..., 12}, and hence the set of records UU Child Male Elementary Child Male Graduate Child Female Graduate Senior Male Elementary maps to 1 2 4 9 10.2.2 Mining Objective The goal of the data-miner is to compute association rules on the above database. Denoting the set of attributes in the U database by C, an association rule is a (statistical) implication of the form Cx =⇒ Cy,whereCx,Cy ⊂ C and Cx ∩ Cy = φ. A rule Cx =⇒ Cy issaidtohaveasupport (or fre- quency) factor s iff at least s% of the transactions in U satisfy Cx ∪ Cy.A rule Cx =⇒ Cy is satisfied in U with a confidence factor c iff at least c% of the transactions in U that satisfy Cx also satisfy Cy. Both support and confi- dence are fractions in the interval [0,1]. The support is a measure of statistical significance, whereas confidence is a measure of the strength of the rule. A rule is said to be “interesting” if its support and confidence are greater than user-defined thresholds supmin and conmin, respectively, and the objective of the mining process is to find all such interesting rules. It has been shown in [7] that achieving this goal is effectively equivalent to generating all subsets of C that have support greater than supmin – these subsets are called frequent itemsets. Therefore, the mining objective is, in essence, to efficiently discover all frequent itemsets that are present in the database. 10.2.3 Privacy Mechanisms We now move on to considering the various mechanisms through which privacy of the user data could be provided. One approach to address this prob- lem is for the service providers to assure the users that the databases obtained from their information would be anonymized through the variety of techniques 242 Privacy-Preserving Data Mining: Models and Algorithms proposed in the statistical database literature [1, 38], before being supplied to the data miners. For example, the swapping of values between different customer records, as proposed in [17]. Depending on the service provider to guarantee privacy can be referred to as a “B2B (business-to-business)” privacy environment. However, in today’s world, most users are (perhaps justifiably) cynical about such assurances, and it is therefore imperative to demonstrably provide privacy at the point of data collection itself, that is, at the user site. This is referred to as the “B2C (business-to-customer)” privacy environment [47]. Note that in this environment, any technique that requires knowledge of other user records becomes infeasible, and therefore the B2B approaches cannot be applied here. The bulk of the work in privacy-preserving data mining of association rules has addressed the B2C environment (e.g. [2, 9, 19, 34]), where the user’s true data has to be anonymized at the source itself. Note that the anonymization process has to be implemented by a program which could be supplied either by the service provider or, more likely, by an independent trusted third-party vendor. Further, this program has to be verifiably secure – therefore, it must be simple in construction, eliminating the possibility of the true data being surreptitiously supplied to the service provider. In a nutshell, the goal of these techniques is to ensure the privacy of the raw local data at the source, but, at the same time, to support accurate reconstruction of the global data mining models at the destination. Within the above framework, the general approach has been to adopt a data perturbation strategy, wherein each individual user’s true data is altered in some manner before forwarding to the service provider. Here, there are two possibilities: statistical distortion, which has been the predominant tech- nique, and algebraic distortion, proposed in [47]. In the statistical approach, a common randomizing algorithm is employed at all user sites, and this al- gorithm is disclosed to the eventual data miner. For example, in the MASK technique [34], targeted towards “market-basket” type of sparse boolean data- bases, each bit in the true user transaction vector is independently flipped with a parametrized probability. While there is only one-way communication from users to the service provider in the statistical approach, the algebraic scheme, in marked contrast, requires two-way communication between the data miner and the user. Here, the data miner supplies a user-specific perturbation vector, and the user then returns the perturbed data after applying this vector on the true data, discretiz- ing the output and adding some noise. The vector is dependent on the current contents of the perturbed database available with the miner and, for large en- terprises, the data collection process itself could become a bottleneck in the efficient running of the system. Mining Association Rules under Privacy Constraints 243 Within the statistical approach, there are two further possibilities: (a) A sim- ple independent attribute perturbation, wherein the value of each attribute in the user record is perturbed independently of the rest; or (b) A more gener- alized dependent attribute perturbation, where the perturbation of each at- tribute may be affected by the perturbations of the other attributes in the record. Most of the statistical perturbation techniques in the literature, in- cluding [18, 19, 34], fall into the independent attribute perturbation category. Notice, however, that this is in a sense antithetical to the original goal of as- sociation rule mining, which is to identify correlations across attributes.This limitation is addressed in [10], which employs a dependent attribute perturba- tion model, with each attribute in the user’s data vector being perturbed based on its own value as well as the perturbed values of the earlier attributes. Another model of privacy-preserving data mining is the k-anonymity model [35, 2], where each record value is replaced with a corresponding generalized value. Specifically, each perturbed record cannot be distinguished from at least k other records in the data. However, this falls into the B2C model since the intermediate database-forming-server can learn or recover precise records. 10.2.4 Privacy Metric Independent of the specific scheme used to achieve privacy, the end result is that the miner receives as input the perturbed database V and the perturbation technique T used to produce this database. From these inputs, the miner at- tempts to reconstruct the original distribution of the true database U, and mine this reconstructed database to obtain the association rules. Given this frame- work, the general notion of privacy in the association rule mining literature is the level of certainity with which the data miner can reconstruct the true data values of users. The certainity can be evaluated at various levels: Average Privacy. This metric measures the reconstruction probability of a random value in the database. Worst-case Privacy. This metric measures the maximum reconstruction probability across all the values in the database. Re-interrogated Privacy. A common system environment is where the miner does not have access to the perturbed database after the completion of the mining process. But it is also possible to have situations wherein the miner can use the mining output (i.e. the association rules) to subsequently re-interrogate the perturbed database, possibly resulting in reduced privacy. Amplification Privacy. A particularly strong notion of privacy, called “am- plification”, was presented in [18], which guarantees strict limits on privacy 244 Privacy-Preserving Data Mining: Models and Algorithms breaches of individual user information, independent of the distribution of the true data. Here, the property of a data record Ui is denoted by Q(Ui).Forex- ample, consider the following record from the example dataset U discussed earlier: Age Sex Education Child Male Elementary Sample properties of the record include Q1(Ui) ≡ “Age = Child and Sex = Male”,and Q2(Ui) ≡ “Age = Child or Adult”. In this context, the prior probability of a property of a customer’s private in- formation is the likelihood of the property in the absence of any knowledge about the customer’s private information. On the other hand, the posterior probability is the likelihood of the property given the perturbed information from the customer and the knowledge of the prior probabilities through recon- struction from the perturbed database. In order to preserve the privacy of some property of a customer’s private information, the posterior probability of that property should not be unduly different to that of the prior probability of the property for the customer. This notion of privacy is quantified in [18] through the following results, where ρ1 and ρ2 denote the prior and posterior probabil- ities, respectively: Privacy Breach: An upward ρ1-to-ρ2 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui)] ≤ ρ1 and P[Q(Ui)|R(Ui)=v] ≥ ρ2. Conversely, a downward ρ2-to-ρ1 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui)] ≥ ρ2 and P[Q(Ui)|R(Ui)=v] ≤ ρ1. Amplification: Let the perturbed database be V = {V1,...,VN}, with do- main SV, and corresponding index set IV. For example, given the sample database U discussed above, and assuming that each attribute is distorted to produce a value within its original domain, the distortion may result in VV 5 7 2 12 which maps to Adult Male Elementary Adult Female Elementary Child Male Graduate Senior Female Graduate Let the probability of an original customer record Ui = u, u ∈ IU being perturbed to a record Vi = v, v ∈ IV be p(u → v),andletA denote the matrix of these transition probabilities, with Avu = p(u → v). Mining Association Rules under Privacy Constraints 245 With the above notation,a randomization operator R(u) ∀u1,u2 ∈ SU: p[u1 → v] p[u2 → v] ≤ γ where γ ≥ 1 and ∃u : p[u → v] > 0. Operator R(u) is at most γ-amplifying if it is at most γ-amplifying for all qualifying v ∈ SV. Breach Prevention: Let R be a randomization operator, v ∈ SV be a random- ized value such that ∃u : p[u → v] > 0,andρ1,ρ2 (0 <ρ1 <ρ2 < 1) be two probabilities as per the above privacy breach definition. Then, if R is at most γ-amplifying for v, revealing “R(u)=v” will cause neither upward (ρ1-to-ρ2) nor downward (ρ2-to-ρ1) privacy breaches with respect to any property if the following condition is satisfied: ρ2(1 − ρ1) ρ1(1 − ρ2) >γ If this situation holds, R is said to support (ρ1,ρ2) privacy guarantees. 10.2.5 Accuracy Metric For association rule mining on a perturbed database, two kinds of errors can occur: Firstly, there may be support errors, where a correctly-identified frequent itemset may be associated with an incorrect support value. Secondly, there may be identity errors, wherein either a genuine frequent itemset is mis- takenly classified as rare, or the converse, where a rare itemset is claimed to be frequent. The Support Error (µ) metric reflects the average relative error (in per- cent) of the reconstructed support values for those itemsets that are correctly identified to be frequent. Denoting the number of frequent itemsets by |F|,the reconstructed support by sup and the actual support by sup, the support error is computed over all frequent itemsets as µ = 1 | F |Σf∈F | supf − supf | supf ∗ 100 The Identity Error (σ) metric, on the other hand, reflects the percentage er- ror in identifying frequent itemsets and has two components: σ+, indicating the percentage of false positives, and σ− indicating the percentage of false negatives. Denoting the reconstructed set of frequent itemsets with R and the correct set of frequent itemsets with F, these metrics are computed as σ+ = |R−F | |F | ∗ 100 σ− = |F −R| |F | * 100 Note that in some papers (e.g. [47]), the accuracy metrics are taken to be the worst-case, rather than average-case, versions of the above errors. 246 Privacy-Preserving Data Mining: Models and Algorithms 10.3 Evolution of the Literature From the database perspective, the field of privacy-preserving data mining was catalyzed by the pioneering work of [9]. In that work, developing privacy- preserving data classifiers by adding noise to the record values was proposed and analyzed. This approach was extended in [3] and [26] to address a variety of subtle privacy loopholes. Concurrently, the research community also began to look into extending privacy-preserving techniques to alternative mining patterns such as associ- ation rules, clustering, etc. For association rules, two streams of literature emerged, as mentioned earlier, one looking at providing input data privacy, and the other considering the protection of sensitive output rules. An important point to note here is that unlike the privacy-preserving classifier approaches that were based on adding a noise component to continuous-valued data, the privacy-preserving techniques in association-rule mining are based on proba- bilistic mapping from the domain space to the range space, over categorical atttributes. With regard to input data privacy, the early papers include [34, 19], which proposed the MASK algorithm and the Cut-and-Paste operators, respectively. MASK. In MASK [34], a simple probabilistic distortion of user data, em- ploying random numbers generated from a pre-defined distribution function, was proposed and evaluated in the context of sparse boolean databases, such as those found in “market-baskets”. The distortion technique was simply to flip each 0 or 1 bit with a parametrized probability p, or to retain as is with the com- plementary probability 1−p, and the privacy metric used was average privacy. Through a theoretical and empirical analysis, it was shown that the p parameter could be carefully tuned to simultaneously achieve acceptable average privacy and good accuracy. However, it was also found that mining the distorted database could be or- ders of magnitude more time-consuming as compared to mining the original database. This issue was addressed in a followup work [12] which showed that by generalizing the distortion process to perform symbol-specific dis- tortion (i.e. different flipping probabilities for different values), appropriately chooosing these distortion parameters, and applying a variety of set-theoretic optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved. Cut-and-Paste Operator. The notion of a privacy breach was introduced in [19] as the following: The presence of an itemset I in the randomized trans- action causes a privacy breach of level ρ if it is possible to infer, for some Mining Association Rules under Privacy Constraints 247 transaction in the true database, that the probability of some item i occuring in it exceeds rho. With regard to this worst-case privacy metric, a set of randomizing privacy operators were presented and analyzed in [19]. The starting point was Uniform Randomization, where each existing item in the true transaction is, with proba- bility p, replaced with a new item not present in the original transaction. (Note that this means that the number of items in the randomized transaction is al- ways equal to the number in the original transaction, and is therefore different from MASK where the number of items in the randomized transaction is usu- ally significantly more than its source since the flipping is done on both the 1’s and the 0’s in the transaction bit vector.) It was then pointed out that a basic deficiency of the uniform randomization approach is that while it might, with a suitable choice of p, be capable of providing acceptable average privacy, its worst case privacy could be significantly weaker. To address this issue, an alternative select-a-size (SaS) randomization oper- ator was proposed, which is composed of the following steps, employed on a per-transaction basis: Step 1: For customer transaction ti of length m, a random integer j from [1,m] is first chosen with probability pm[j]. Step 2: Then, j items are uniformly and randomly selected from the true trans- action and inserted into the randomized transaction. Step 3: Finally, a uniformly and randomly chosen fraction ρm of the remain- ing items in the database that are not present in the true transaction (i.e. C− items in ti), are inserted into the randomized transaction. In short, the final randomized transaction is composed of a subset of true items from the original transaction and additional false items from the complemen- tary set of items in the database. A variant of the SaS operator studied in detail in [19] is the cut-and-paste (C&P) operator. Here, an additional parameter is a cutoff integer, Km, with the integer j being chosen from [1,Km], rather than from [1,m]. If it turns out that j>m,thenj is set to m (which means that the entire original transaction is copied to the randomized transaction). Apart from the cutoff threshold, an- other difference between C&P and SaS is that the subsequent ρm randomized insertion (Step 3 above) is carried out on (a) the items that are not present in the true transaction (as in SaS), and (b) additionally, on the remaining items in the true transaction that were not selected for inclusion in Step 2. An issue in the C&P operator is the optimal selection of the ρm and Km pa- rameters, and combinatorial formulae for determining their values are given in [19]. Through a detailed set of experiments on real-life datasets, it was shown that even with a challenging privacy requirement of not permitting any 248 Privacy-Preserving Data Mining: Models and Algorithms breaches with ρ>50%, mining a C&P-randomized database was able to cor- rectly identify around 80 to 90% of the “short” frequent itemsets, that is fre- quent itemsets of lengths upto 3. The issue of how to safely randomize and mine long transactions was left as an open problem, since directly using C&P in such environments could result in unacceptably poor accuracy. The above work was significantly extended in [18] through, as discussed in Section 10.2.4, the formulation of strict amplification-based privacy metrics and delineating a methodology for limiting the associated privacy breaches. Distributed Databases. Maintaining input data privacy was also consid- ered in [41, 25] in the context of databases that are distributed across a number of sites with each site only willing to share data mining results, but not the source data. While [41] considered data that is vertically partitioned (i.e., each site hosts a disjoint subset of the matrix columns), the complementary situa- tion where the data is horizontally partitioned (i.e., each site hosts a disjoint subset of the matrix rows) is addressed in [25]. The solution technique in [41] requires generating and computing a large set of independent linear equations – in fact, the number of equations and the number of terms in each equation is proportional to the cardinality of the database. It may therefore prove to be expensive for market-basket databases which typically contain millions of customer transactions. In [25], on the other hand, the problem is modeled as a secure multi-party computation [23] and an algorithm that minimizes the in- formation shared without incurring much overhead on the mining process is presented. Note that in these formulations, a pre-existing true database at each site is assumed, i.e. a B2B model. Algebraic Distortion. Then, in [47], an algebraic-distortion mechanism was presented that unlike the statistical approach of the prior literature, requires two-way communication between the miner and the users. If Vc is the current perturbed database, then Ek is computed by the miner, which corresponds to the eigenvectors corresponding to the largest k eigenvalues of VcTVc,where VcT is the transpose of Vc. The choice of k makes a tradeoff between privacy and accuracy – large values of k give more accuracy and less privacy, while small values provide higher privacy and less accuracy. Ek is supplied to the user, who then uses it on her true transaction vector, discretizes the output, and then adds a noise component. Their privacy metric is rather different, in that they evaluate the level of privacy by measuring the probability of an “unwanted” item to be included in the perturbed transaction. The definition of unwanted here is that it is an item that does not contribute to association rule mining in the sense that it does not appear in any frequent itemset. An implication is that privacy esti- mates can be conditional on the choices of association rule mining parameters Mining Association Rules under Privacy Constraints 249 (supmin,conmin). This may encourage the miner to experiment with a variety of values in order to maximize the breach of privacy. Output Rule Privacy. We now turn our attention to the issue of maintaining the privacy of output rules. That is, we would like to alter the original database in a manner such that only the association rules deemed to be sensitive by the owner of the data source cannot be identified through the mining process. The proposed solutions involve either falsifying some of the entries in the true database or replacing them with null values. Note that, by definition, these techniques require a completely materialized true database as the starting point, in contrast to the B2C techniques for input data privacy. In [13], the process of transforming the database to hide sensitive rules is termed as “sanitization”, and in practical terms, this requires reducing either the support or the confidence of the sensitive rules to below the supmin or conmin thresholds. Specifically, using R to refer to the set of all rules, and S to refer to the set of sensitive rules, the goal is to hide all the S rules by reducing the supports or confidences, and simultaneously minimize the number of rules in R − S that may also become hidden as a side-effect of the sanitization process. (Note that the objective is only to maintain the visibility of rules in R − S, allowing the specific supports or confidences obtained by the miner for the R − S rules to be altered if required. That is, it would be perfectly acceptable for the database to be sanitized such that a rule with high support or confidence in R − S became a rule that was just above the threshold in the sanitized database.) The sanitization can be achieved in different ways: 1) By changing the val- ues of individual entries in the database; or, 2) By removing entire transactions from the database. It was shown in the initial work of [13], which only con- sidere the lowering of support values, that, irrespective of the sanitization ap- proach, finding the optimal (w.r.t. minimizing the impact on R−S) sanitization is an NP-Hard problem (through reduction from the Hitting Set problem [21]). A greedy heuristic technique was suggested, where the S set is ordered in de- creasing order of support, and then each element is hidden in the ordered set is hidden in an iterative fashion. The hiding is done by performing a greedy search through the ancestors of the itemset, selecting at each level the parent with the maximum support and setting the selected parent as the new item- set that needs to be hidden. At the end of the process, a frequent item has been selected. The algorithm searches through the common list of transactions that support both the selected item and the initial frequent itemset to be hid- den in order to identify the transaction that affects the minimum number of 2- itemsets. After this transaction is identified, then the selected frequent item is removed from the identified transaction. The effects of this database alteration 250 Privacy-Preserving Data Mining: Models and Algorithms are propagated to the other itemset elements, and the process repeats until the itemset is hidden. The above work was extended in [15] to achieve hiding by also using the the confidence criterion. Unlike the purely support-based hiding approach where only 1’s are converted to 0’s, hiding through the confidence criterion can be achieved by converting 0’s into 1’s. However, an associated danger is that there can now be false positives, that is, infrequent rules may be incorrectly promoted into the frequent category. A detailed treatment of this issue is pre- sented in [44]. An alternative approach for output rule privacy proposed in [37, 36] is to use the concept of “data blocking”, wherein some values in the database are replaced with NULLs signifying unknowns. In this framework, the notions of itemset support and confidence are converted into intervals, with the actual support and confidence lying within these intervals. For example, the mini- mum support of itemset Cx is the percentage of transactions that have 1’s for this itemset, while the maximum possible support is the percentage of trans- actions that contain either 1 or NULL for this itemset. Greedy algorithms for implementing the hiding are presented, and a discussion of their effectiveness is provided in [36]. More recently, decision-theoretic approaches based on data blocking are presented in [30, 22], which also utilize the “border theory” of frequent itemsets [40] – however, these approaches can be computationally demanding. The rule-hiding techniques have limitations in that (a) they crucially depend on the data miner processing the database only with the specified supports and confidence levels – this may be hard to ensure in practice; (b) they may in- troduce significant false positives and false negatives in the non-sensitive set of rules; (c) they may introduce significant changes in the supports and confi- dences of the non-sensitive set of rules; and (c) in the case of data blocking, it may be sometimes possible to infer the hidden rules by assigning values to the null attributes. Frameworks. A common trend in the input data privacy literature was to propose specific perturbation techniques, which are then analyzed for their pri- vacy and accuracy properties. Recently, in [10], the problem was approached from a different perspective, wherein a generalized matrix-theoretic framework that facilitates a systematic approach to the design of random perturbation schemes for privacy-preserving mining was proposed. This framework sup- ports amplification-based privacy, and its execution and memory overheads are comparable to that of classical mining on the true database. The distinguishing feature of FRAPP is its quantitative characterization of the sources of error in the random data perturbation and model reconstruction processes. Mining Association Rules under Privacy Constraints 251 In fact, although it uses dependent attribute perturbation, it is fully de- composable into the perturbation of individual attributes, and hence has the same run-time complexity as any independent perturbation method. Through the framework, many of the earlier techniques are cast as special instances of the FRAPP perturbation matrix. More importantly, it was shown that through appropriate choices of matrix elements, new perturbation techniques can be constructed that provide highly accurate mining results even under strict amplification-based [18] privacy guarantees. In fact, a perturbation matrix with provably minimal condition number1, was identified, substantially improving the accuracy under the given constraints. Finally, an efficient integration of this optimal matrix with the association mining process was outlined. 10.4 The FRAPP Framework In the remainder of this chapter, we present, as a representative example, the salient details of FRAPP and discuss how it simultaneously provides strong privacy, high accuracy and good efficiency, in a B2C privacy-preserving envi- ronment of mining association rules. As mentioned earlier, let the probability of an original customer record Ui = u, u ∈ IU being perturbed to a record Vi = v, v ∈ IV be p(u → v),andlet A denote the matrix of these transition probabilities, with Avu = p(u → v). This random process maps to a Markov process, and the perturbation matrix A should therefore satisfy the following properties [39]: Avu ≥ 0 and v∈IV Avu =1 ∀u ∈ IU,v ∈ IV(10.1) Due to the constraints imposed by Equation 10.1, the domain of A is a subset of R|SV |×|SU |. This domain is further restricted by the choice of perturbation method. For example, for the MASK technique [34], all the entries of matrix A are decided by the choice of a single parameter, namely, the flipping proba- bility. We now explore the preferred choices of A to simultaneously achieve pri- vacy guarantees and high accuracy, without restricting ab initio to a particular perturbation method. From the previously-mentioned results of [18], the following condition on the perturbation matrix A in order to support (ρ1,ρ2) privacy can be derived: Avu1 Avu2 ≤ γ<ρ2(1 − ρ1) ρ1(1 − ρ2) ∀u1,u2 ∈ IU, ∀v ∈ IV(10.2) 1In the class of symmetric positive-definite matrices (refer Section 10.4.2.1). 252 Privacy-Preserving Data Mining: Models and Algorithms That is, the choice of perturbation matrix A should follow the restriction that the ratio of any two matrix entries (in a row) should not be more than γ. 10.4.1 Reconstruction Model We now analyze how the distribution of the original database is recon- structed from the perturbed database. As per the perturbation model, a client Ci with data record Ui = u, u ∈ IU generates record Vi = v, v ∈ IV with prob- ability p[u → v]. This event of generation of v can be viewed as a Bernoulli trial with success probability p[u → v]. If the outcome of the ith Bernoulli trial is denoted by the random variable Y i v , the total number of successes Yv in N trials is given by the sum of the N Bernoulli random variables: Yv = N i=1 Y i v (10.3) That is, the total number of records with value v in the perturbed database is given by Yv. Note that Yv is the sum of N independent but non-identical Bernoulli trials. The trials are non-identical because the probability of success varies from trial i to trial j, depending on the values of Ui and Uj, respectively. The distribu- tion of such a random variable Yv is known as the Poisson-Binomial distribu- tion [45]. From Equation 10.3, the expectation of Yv is given by E(Yv)= N i=1 E(Y i v )= N i=1 P(Y i v =1) (10.4) Using Xu to denote the number of records with value u in the original database, and noting that P(Y i v =1)=p[u → v]=Avu for Ui = u, results in E(Yv)= u∈IU AvuXu (10.5) Let X =[X1X2 ···X|SU |]T,Y =[Y1Y2 ···Y|SV |]T. Then, the following ex- pression is obtained from Equation 10.5: E(Y)=AX (10.6) At first glance, it may appear that X, the distribution of records in the orig- inal database (and the objective of the reconstruction exercise), can be directly obtained from the above equation. However, an immediate difficulty is that that the data miner does not possess E(Y), but only a specific instance of Y, with Mining Association Rules under Privacy Constraints 253 which she has to approximate E(Y).2 Therefore, the following approximation to Equation 10.6 is resorted to: Y = A X(10.7) where X is estimated as X. This is a system of |SV | equations in |SU | un- knowns, and for the system to be uniquely solvable, a necessary condition is that the space of the perturbed database is a superset of the original database (i.e. |SV |≥|SU |). Further, if the inverse of matrix A exists, the solution of this system of equations is given by X = A−1Y(10.8) providing the desired estimate of the distribution of records in the original database. Note that this estimation is unbiased because E( X)=A−1E(Y)= X. 10.4.2 Estimation Error To analyze the error in the above estimation process, the following well- known theorem from linear algebra applies [39]: Theorem 10.1 Given an equation of the form Ax = b and that the mea- surement of b is in-exact, the relative error in the solution x = A−1b satisfies  δx   x  ≤ c  δb   b  where c is the condition number of matrix A. For a positive-definite matrix, c = λmax/λmin,whereλmax and λmin are the maximum and minimum eigen-values of matrix A, respectively. In- formally, the condition number is a measure of the sensitivity of a matrix to numerical operations. Matrices with condition numbers near one are said to be well-conditioned, i.e. stable, whereas those with condition numbers much greater than one (e.g. 105 for a 5 ∗ 5 Hilbert matrix [39]) are said to be ill- conditioned, i.e. highly sensitive. Equations 10.6 and 10.8, coupled with Theorem 10.1, result in  X − X   X  ≤ c  Y − E(Y)   E(Y)  (10.9) 2If multiple distorted versions are provided, then E(Y) is approximated by the observed average of these versions. 254 Privacy-Preserving Data Mining: Models and Algorithms which means that the error in estimation arises from two sources: First, the sensitivity of the problem, indicated by the condition number of matrix A;and second, the deviation of Y from its mean, i.e. the deviation of perturbed data- base counts from their expected values, indicated by the variance of Y.Inthe remainder of this sub-section, we determine how to reduce this error by (a) ap- propriate choice of perturbation matrix to minimize the condition number, and (b) identifying the minimum size of the database required to (probabilistically) bound the deviation within a desired threshold. 10.4.2.1 Minimizing the Condition Number. The perturbation tech- niques proposed in the literature primarily differ in their choices for perturba- tion matrix A. For example: MASK [34] uses a matrix A with Avu = pk(1 − p)Mb−k (10.10) where Mb is the number of boolean attributes when each categorical attribute j is converted into | Sj U | boolean attributes, (1 − p) is the bit flipping probability for each boolean attribute, and k is the number of attributes with matching bits between the perturbed value v and the original value u. The cut-and-paste (C&P) randomization operator [19] employs a matrix A with Avu = M z=0 pM[z] · min{z,lu,lv} q=max{0,z+lu−M,lu+lv−Mb} lu Cq M−lu Cz−q MCz ·Mb−lu Clv−qρ(lv−q)(1 − ρ)(Mb−lu−lv+q) (10.11) where pM[z]= min{K,z} w=0 M−wCz−wρ(z−w)(1 − ρ)(M−z) · 1 − M/(K +1) if w = M& w