TY - JOUR
T1 - Addressing Data Scarcity with Synthetic Data
T2 - A Secure and GDPR-compliant Cloud-Based Platform
AU - Borovits, Nemania
AU - Bardelloni, Gianluigi
AU - Hashemi, Hossein
AU - Tulsiani, Masoom
AU - Tamburri, Damian Andrew
AU - van den Heuvel, Willem-Jan
PY - 2025/4/29
Y1 - 2025/4/29
N2 - This study presents a cloud-based platform for synthetic data generation, validation and evaluation, developed to address data scarcity in the telecommunications sector while ensuring compliance with the General Data Protection Regulation (GDPR). In collaboration with a Dutch telecommunications provider facing data scarcity due to low user-consent rates, we developed a platform that allows synthetic data vendors to securely generate synthetic data based on schema input without accessing sensitive information. Vendors uploaded containerized executables for synthetic data generation and the platform automated infrastructure provisioning, ensuring no access to personal data. A validation mechanism minimized the risk of re-identification by ensuring that the synthetic data did not inadvertently replicate real data points. We mutually agreed with the vendors on five evaluation metrics and the platform logged and calculated performance for each, allowing them to refine their algorithms. To validate the platform’s performance, we conducted an offline study with the TV viewership team, using each vendor’s synthetic data to generate viewership categories. The vendor with the best evaluation metrics also produced categories most similar to the real data, confirming the platform’s effectiveness. This study, involving two vendors and a telecommunications company, demonstrated the platform’s applicability in addressing business challenges while ensuring privacy compliance.
AB - This study presents a cloud-based platform for synthetic data generation, validation and evaluation, developed to address data scarcity in the telecommunications sector while ensuring compliance with the General Data Protection Regulation (GDPR). In collaboration with a Dutch telecommunications provider facing data scarcity due to low user-consent rates, we developed a platform that allows synthetic data vendors to securely generate synthetic data based on schema input without accessing sensitive information. Vendors uploaded containerized executables for synthetic data generation and the platform automated infrastructure provisioning, ensuring no access to personal data. A validation mechanism minimized the risk of re-identification by ensuring that the synthetic data did not inadvertently replicate real data points. We mutually agreed with the vendors on five evaluation metrics and the platform logged and calculated performance for each, allowing them to refine their algorithms. To validate the platform’s performance, we conducted an offline study with the TV viewership team, using each vendor’s synthetic data to generate viewership categories. The vendor with the best evaluation metrics also produced categories most similar to the real data, confirming the platform’s effectiveness. This study, involving two vendors and a telecommunications company, demonstrated the platform’s applicability in addressing business challenges while ensuring privacy compliance.
U2 - 10.1145/3732937
DO - 10.1145/3732937
M3 - Article
SN - 1049-331X
VL - XX
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - X
ER -