Addressing Data Scarcity with Synthetic Data: A Secure and GDPR-compliant Cloud-Based Platform

Nemania Borovits, Gianluigi Bardelloni, Hossein Hashemi, Masoom Tulsiani, Damian Andrew Tamburri, Willem-Jan van den Heuvel

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

This study presents a cloud-based platform for synthetic data generation, validation and evaluation, developed to address data scarcity in the telecommunications sector while ensuring compliance with the General Data Protection Regulation (GDPR). In collaboration with a Dutch telecommunications provider facing data scarcity due to low user-consent rates, we developed a platform that allows synthetic data vendors to securely generate synthetic data based on schema input without accessing sensitive information. Vendors uploaded containerized executables for synthetic data generation and the platform automated infrastructure provisioning, ensuring no access to personal data. A validation mechanism minimized the risk of re-identification by ensuring that the synthetic data did not inadvertently replicate real data points. We mutually agreed with the vendors on five evaluation metrics and the platform logged and calculated performance for each, allowing them to refine their algorithms. To validate the platform’s performance, we conducted an offline study with the TV viewership team, using each vendor’s synthetic data to generate viewership categories. The vendor with the best evaluation metrics also produced categories most similar to the real data, confirming the platform’s effectiveness. This study, involving two vendors and a telecommunications company, demonstrated the platform’s applicability in addressing business challenges while ensuring privacy compliance.
Original languageEnglish
JournalACM Transactions on Software Engineering and Methodology
VolumeXX
Issue numberX
Early online date29 Apr 2025
DOIs
Publication statusE-pub ahead of print - 29 Apr 2025

Fingerprint

Dive into the research topics of 'Addressing Data Scarcity with Synthetic Data: A Secure and GDPR-compliant Cloud-Based Platform'. Together they form a unique fingerprint.

Cite this