TY - JOUR
T1 - Real-world K-Anonymity applications
T2 - The KGEN approach and its evaluation in fraudulent transactions
AU - De Pascale, Daniel
AU - Cascavilla, Giuseppe
AU - Tamburri, Damian A.
AU - Van Den Heuvel, Willem-Jan
PY - 2023/5
Y1 - 2023/5
N2 - K-Anonymity is a property for the measurement, management, and governance of the data anonymization. Many implementations of k-anonymity have been described in state of the art, but most of them are not practically usable over a large number of attributes in a “Big” dataset, i.e., a dataset drawing from Big Data. To address this significant shortcoming, we introduce and evaluate KGEN, an approach to K-anonymity featuring meta-heuristics, specifically, Genetic Algorithms to compute a permutation of the dataset which is both K-anonymized and still usable for further processing, e.g., for private-by-design analytics. KGEN promotes such a meta-heuristic approach since it can solve the problem by finding a pseudo-optimal solution in a reasonable time over a considerable load of input. KGEN allows the data manager to guarantee a high anonymity level while preserving the usability and preventing loss of information entropy over the data. Differently from other approaches that provide optimal global solutions compatible with smaller datasets, KGEN works properly also over Big datasets while still providing a good-enough K-anonymized but still processable dataset. Evaluation results show how our approach can still work efficiently on a real world dataset, provided by Dutch Tax Authority, with 47 attributes (i.e., the columns of the dataset to be anonymized) and over 1.5K+ observations (i.e., the rows of that dataset), as well as on a dataset with 97 attributes and over 3942 observations.
AB - K-Anonymity is a property for the measurement, management, and governance of the data anonymization. Many implementations of k-anonymity have been described in state of the art, but most of them are not practically usable over a large number of attributes in a “Big” dataset, i.e., a dataset drawing from Big Data. To address this significant shortcoming, we introduce and evaluate KGEN, an approach to K-anonymity featuring meta-heuristics, specifically, Genetic Algorithms to compute a permutation of the dataset which is both K-anonymized and still usable for further processing, e.g., for private-by-design analytics. KGEN promotes such a meta-heuristic approach since it can solve the problem by finding a pseudo-optimal solution in a reasonable time over a considerable load of input. KGEN allows the data manager to guarantee a high anonymity level while preserving the usability and preventing loss of information entropy over the data. Differently from other approaches that provide optimal global solutions compatible with smaller datasets, KGEN works properly also over Big datasets while still providing a good-enough K-anonymized but still processable dataset. Evaluation results show how our approach can still work efficiently on a real world dataset, provided by Dutch Tax Authority, with 47 attributes (i.e., the columns of the dataset to be anonymized) and over 1.5K+ observations (i.e., the rows of that dataset), as well as on a dataset with 97 attributes and over 3942 observations.
KW - Big data
KW - Data-intensive applications design & operations
KW - K-Anonymity
KW - Privacy-by design
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=85148330364&partnerID=8YFLogxK
U2 - 10.1016/j.is.2023.102193
DO - 10.1016/j.is.2023.102193
M3 - Article
AN - SCOPUS:85148330364
SN - 0306-4379
VL - 115
JO - Information Systems
JF - Information Systems
M1 - 102193
ER -