Yue Qi, Lorrie Herbault, Hadrien Lautraite, Michael, Katleen Blanchet, Christian Vincelette, Louis Mullie, Guillaume Dumas, Jean-François Rajotte, Kamran Afzali, Sébastien Gambs, Michaël Chassé
Artificial Intelligence in the Life Sciences
Publication year: 2026

Abstract

Background

Synthetic data enables open and efficient medical research by enhancing real-world data. We examined the utility-privacy trade-off of synthetic data generated with and without differential privacy (DP) using CLOVER, a novel open-source Python library that we have developed.

Methods

We generated synthetic datasets based on data from MIMIC-III (24 variables, n = 15,118) and eICU (23 variables, n = 3,726). The generative approaches used were SMOTE, DataSynthesizer, Synthpop, MST, CTGAN, TVAE, CTABGAN+, and FinDiff, with and without DP. We evaluated the utility and privacy of the generated datasets based on univariate, bivariate, and population fidelity; analysis-specific and distance-based metrics; and membership inference attacks (MIAs). We benchmarked the synthetic datasets using rank-derived scores for utility and privacy. We examined the impact of DP on machine learning (ML) performance and MIAs and analyzed the achievable utility-privacy trade-off by generating synthetic data across a range of privacy regimes. We compared computational resource usage across generators.

Findings

When fully relaxing DP constraints, MST (ε = 105 and δ = 0·9999) ranked the most private on MIMIC-III and the second most private on eICU (DCR and NNDR well above baseline for both datasets, top 1% precision for MIAs below 0·53 and 0·62 for MIMIC-III and eICU, respectively) but placed 7th out of eight in utility for both datasets. Conversely, Synthpop ranked first in utility for both datasets. It achieved Hellinger distance of 0·88 × 10-2 and 1·41 × 10-2; pairwise correlation difference of 0·31 and 0·68; distinguishability of 0·02 × 10-1 and 0·02 × 10-1; AUC difference of 0·20 × 10-1 and 0·10 × 10-1 on classification task; and RMSE difference of 2·53 × 10-2 and 13·64 × 10-2 on regression tasks for MIMIC-III and eICU, respectively. However, it ranked 7th in privacy for both datasets. SMOTE and TVAE were each outperformed by at least one other generator in terms of both utility and privacy based on the rank-derived scores for both datasets. Once DP was introduced, utility decreased across all algorithms, with no method consistently outperforming others across all privacy regimes.

Interpretation

There is a tradeoff between utility and privacy in the non-DP setting. DP reduced utility but ensured a consistent level of privacy, allowing for a fair comparison of the utility of different generators. Selecting an appropriate generator depends on the privacy needs, intended use case and the user’s available resources.

Keywords

Synthetic data, Healthcare data, Differential privacy.

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.