Measuring the Utility of Synthetic Data : An Empirical Evaluation of Population Fidelity Measures as Indicators of Synthetic Data Utility in Classification Tasks

University essay from Karlstads universitet/Institutionen för matematik och datavetenskap (from 2013)

Abstract: In the era of data-driven decision-making and innovation, synthetic data serves as a promising tool that bridges the need for vast datasets in machine learning (ML) and the imperative necessity of data privacy. By simulating real-world data while preserving privacy, synthetic data generators have become more prevalent instruments in AI and ML development. A key challenge with synthetic data lies in accurately estimating its utility. For such purpose, Population Fidelity (PF) measures have shown to be good candidates, a category of metrics that evaluates how well the synthetic data mimics the general distribution of the original data. With this setting, we aim to answer: "How well are different population fidelity measures able to indicate the utility of synthetic data for machine learning based classification models?" We designed a reusable six-step experiment framework to examine the correlation between nine PF measures and the performance of four ML for training classification models over five datasets. The six-step approach includes data preparation, training, testing on original and synthetic datasets, and PF measures computation. The study reveals non-linear relationships between the PF measures and synthetic data utility. The general analysis, meaning the monotonic relationship between the PF measure and performance over all models, yielded at most moderate correlations, where the Cluster measure showed the strongest correlation. In the more granular model-specific analysis, Random Forest showed strong correlations with three PF measures. The findings show that no PF measure shows a consistently high correlation over all models to be considered a universal estimator for model performance.This highlights the importance of context-aware application of PF measures and sets the stage for future research to expand the scope, including support for a wider range of types of data and integrating privacy evaluations in synthetic data assessment. Ultimately, this study contributes to the effective and reliable use of synthetic data, particularly in sensitive fields where data quality is vital.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)