Improving the detection of privacy risk in synthetic EHRs

25gmds135 10.3205/25gmds135 urn:nbn:de:0183-25gmds1359 Meeting Abstract Improving the detection of privacy risk in synthetic EHRs Gerloff Gerloff Xenia F. XF

Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany

author Größler Größler Michael M

Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany

author Riemann Riemann Layla Tabea LT

Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany

author German Medical Science GMS Publishing House

Düsseldorf

610 synthetic data privacy electronic health records attribute inference attack 20251103 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0631 135 Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) PS 6: Synthetic data, privacy & consent Jena 20250907 20250911 Abstr. 213 TextIntroduction: Nowadays, hospitals commonly store electronic health records (EHRs) but, due to privacy protection laws, this data can only be distributed under severe constraints. Recent research aims to solve this problem by fitting a generative model to a cohort in a secure environment and creating synthetic EHRs that share the statistical properties of the original cohort but preserve the real patients’ privacy. Easy access to synthetic EHRs from different hospitals would accelerate research in rare diseases and facilitate the training of AI models to assist medical professionals e.g. in deciding on treatments. However, before sharing synthetic EHRs it is necessary to measure their privacy risk by running so-called privacy tests. Here, we propose an interpretable definition of the data’s privacy risk and a novel privacy test unique in its ability to make statistically precise statements. State of the art: Privacy tests usually assess the success of an attacker using the synthetic data set to gain information on individual patients and compare said success to a baseline. Most recent publications on synthetic EHRs , , , , tested an Attribute Inference Attack (AIA) in which the attacker has partial knowledge of some attributes of a real patient and infers the missing attributes based on the synthetic data matching their partial knowledge. So far, there is no consensus on how to choose the known attributes. The baseline is given by the average success of the attack based on an independent real data set instead of synthetic data. Concept: The ideal synthetic data set consists of independently drawn samples from the true distribution of the real data. This ideal poses no privacy risk to any patient in the original cohort. Hence, we define privacy risk as the difference between an attack’s success based on synthetic data and its success based on the true distribution. We apply this definition to test AIAs by setting the baseline to a lower bound of the success based on the true distribution. The bound is determined by confidence intervals computed using the real data set. Additionally, we automate the known attribute selection by randomly adding known attributes until no synthetic matches for any patient exist and repeat this process to increase the trustworthiness of the test. Our privacy test bounds the synthetic data’s deviation from the minimum privacy risk with respect to AIAs at an explicit confidence level and systematizes the selection of known attributes. Implementation: We implemented our self-developed Python package (https://github.com/xeniagerloff/StatAIT) offering automated privacy tests with minimal hyperparameter choices and tools for the detailed analysis of the results. Lessons learned: To the best of our knowledge, we are the first to formulate a privacy test that bounds the privacy risk posed by a synthetic data set at an explicit confidence level. Our Python package enables rigorous and interpretable privacy assessment benefiting the medical community and patients by increasing the trust in synthetic EHRs. In addition to the significant benefits of facilitated data sharing, privacy-preserving synthetic data will further improve patient support for data-driven research.The authors declare that they have no competing interests.The authors declare that an ethics committee vote is not required. Yuan H Zhou S Yu S EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models [Preprint] 2023 arXiv Yuan H, Zhou S, Yu S. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models [Preprint]. arXiv. 2023. DOI: 10.48550/arxiv.2303.05656 https://doi.org/10.48550/arxiv.2303.05656 Sun H Lin H Yan R Collaborative synthesis of patient records through multi-visit health state inference 2024 Proceedings of the AAAI Conference on Artificial Intelligence 19044-52 Sun H, Lin H, Yan R. Collaborative synthesis of patient records through multi-visit health state inference. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(17):19044-52. DOI: 10.1609/aaai.v38i17.29871 https://doi.org/10.1609/aaai.v38i17.29871 Yoon J Mizrahi M Ghalaty NF Jarvinen T Ravi AS Brune P Kong F Anderson D Lee G Meir A Bandukwala F EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records 2023 NPJ digital medicine 141 Yoon J, Mizrahi M, Ghalaty NF, Jarvinen T, Ravi AS, Brune P, Kong F, Anderson D, Lee G, Meir A, Bandukwala F. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ digital medicine. 2023;6(1):141. DOI: 10.1038/s41746-023-00888-7 http://dx.doi.org/10.1038/s41746-023-00888-7 Theodorou B Xiao C Sun J Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model 2023 Nat Commun 5305 Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun. 2023;14(1):5305. DOI: 10.1038/s41467-023-41093-0 http://dx.doi.org/10.1038/s41467-023-41093-0 Das T Wang Z Sun J TWIN: Personalized clinical trial digital twin generation 2023 Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 402-13 Das T, Wang Z, Sun J. TWIN: Personalized clinical trial digital twin generation. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Long Beach CA USA: ACM; 2023. p. 402-13. DOI: 10.1145/3580305.3599534 https://doi.org/10.1145/3580305.3599534 0 0 0 0