Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Improving the detection of privacy risk in synthetic EHRs

Xenia F. Gerloff 1
Michael Größler 1
Layla Tabea Riemann 1
1Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany

Text

Introduction: Nowadays, hospitals commonly store electronic health records (EHRs) but, due to privacy protection laws, this data can only be distributed under severe constraints. Recent research aims to solve this problem by fitting a generative model to a cohort in a secure environment and creating synthetic EHRs that share the statistical properties of the original cohort but preserve the real patients’ privacy. Easy access to synthetic EHRs from different hospitals would accelerate research in rare diseases and facilitate the training of AI models to assist medical professionals e.g. in deciding on treatments. However, before sharing synthetic EHRs it is necessary to measure their privacy risk by running so-called privacy tests. Here, we propose an interpretable definition of the data’s privacy risk and a novel privacy test unique in its ability to make statistically precise statements.

State of the art: Privacy tests usually assess the success of an attacker using the synthetic data set to gain information on individual patients and compare said success to a baseline. Most recent publications on synthetic EHRs [1], [2], [3], [4], [5] tested an Attribute Inference Attack (AIA) in which the attacker has partial knowledge of some attributes of a real patient and infers the missing attributes based on the synthetic data matching their partial knowledge. So far, there is no consensus on how to choose the known attributes. The baseline is given by the average success of the attack based on an independent real data set instead of synthetic data.

Concept: The ideal synthetic data set consists of independently drawn samples from the true distribution of the real data. This ideal poses no privacy risk to any patient in the original cohort. Hence, we define privacy risk as the difference between an attack’s success based on synthetic data and its success based on the true distribution.

We apply this definition to test AIAs by setting the baseline to a lower bound of the success based on the true distribution. The bound is determined by confidence intervals computed using the real data set. Additionally, we automate the known attribute selection by randomly adding known attributes until no synthetic matches for any patient exist and repeat this process to increase the trustworthiness of the test. Our privacy test bounds the synthetic data’s deviation from the minimum privacy risk with respect to AIAs at an explicit confidence level and systematizes the selection of known attributes.

Implementation: We implemented our self-developed Python package (https://github.com/xeniagerloff/StatAIT) offering automated privacy tests with minimal hyperparameter choices and tools for the detailed analysis of the results.

Lessons learned: To the best of our knowledge, we are the first to formulate a privacy test that bounds the privacy risk posed by a synthetic data set at an explicit confidence level. Our Python package enables rigorous and interpretable privacy assessment benefiting the medical community and patients by increasing the trust in synthetic EHRs. In addition to the significant benefits of facilitated data sharing, privacy-preserving synthetic data will further improve patient support for data-driven research.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Yuan H, Zhou S, Yu S. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models [Preprint]. arXiv. 2023. DOI: 10.48550/arxiv.2303.05656
[2] Sun H, Lin H, Yan R. Collaborative synthesis of patient records through multi-visit health state inference. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(17):19044-52. DOI: 10.1609/aaai.v38i17.29871
[3] Yoon J, Mizrahi M, Ghalaty NF, Jarvinen T, Ravi AS, Brune P, Kong F, Anderson D, Lee G, Meir A, Bandukwala F. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ digital medicine. 2023;6(1):141. DOI: 10.1038/s41746-023-00888-7
[4] Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun. 2023;14(1):5305. DOI: 10.1038/s41467-023-41093-0
[5] Das T, Wang Z, Sun J. TWIN: Personalized clinical trial digital twin generation. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Long Beach CA USA: ACM; 2023. p. 402-13. DOI: 10.1145/3580305.3599534