70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Creating synthetic data for the presentation of a post-COVID algorithm
2Hannover Medical School, Hannover, Germany
3University of Göttingen, Campus Institute Data Science, Section of Medical Data Science, Göttingen, Germany
Text
Introduction: Developing and testing new diagnostic algorithms to improve public health often depends on restricted access to medical data, such as electronic health records (EHR). The sensitive nature of patient data calls for consent, privacy, and confidentiality issues. Synthetic data could bypass these obstacles in research and development. It enables researchers to present and demonstrate example applications of those algorithms and supports knowledge dissemination.
State of the art: Azhir et al. [1] developed an algorithm to detect post-acute sequelae of COVID-19 (PASC) by excluding sequelae that prior conditions can explain. The algorithm requires ICD-10-encoded longitudinal EHR data as input [1].
To generate synthetic data, the original data is statistically modeled, those models are then used to generate new records matching the statistical properties of the original data [2]. Several tools are available for data creation. Synthea™ is an open-source software that simulates the lifespan of synthetic patients by modeling realistic EHR data using a generic module framework. It covers modules for various diseases, including COVID-19, and exports patient data and conditions as SNOMED codes [3].
As SNOMED and ICD-10 are recognized and widespread medical terminologies, an associated mapping tool is available [4].
Concept: Generating synthetic health data for presenting the PASC phenotyping algorithm [1] requires at least one COVID-19 diagnosis, three years of baseline data before, and a minimum of one year of follow-up data. Therefore, synthetic data was created using the Synthea™ tool and filtered for COVID-19 diagnosis. The SNOMED nomenclature was mapped to ICD-10 to match the input format of the PASC algorithm. The required Charlson index was computed using the patient data before the required periods were extracted. A subset of PASC-associated symptoms were randomly added into patients' EHRs, as post-COVID is not included in the Synthea™ COVID-19 module, matching the distribution in the original study. Subsequently, the created health data was assessed using the PASC algorithm.
Implementation: Data generation and processing are implemented in Python scripts, which combine and document all steps and are available in GitLab [5]. The synthetic data was iteratively analyzed using the post-COVID algorithm by integrating the data into a corresponding Docker container and executing the integrated R-scripts.
Lessons learned: During the development of synthetic data, several important lessons were learned. First, all Synthea™ data events are created on the base of underlying synthetic diseases, which contradicts the idea that post-COVID symptoms cannot be explained by prior conditions. This challenges the algorithm's identification of non-associated events. The subsequent manipulation of EHRs has to match the patient's health problems, age, and gender to enable the detection of temporal correlations. Second, even though a SNOMED-to-ICD-10 mapping tool is openly available, it requires a mapping file linked to a licensing process via the National Library of Medicine.
Finally, when generating synthetic data, one must always be clear about the use case and whether the studied problem is sufficiently captured in the data. Despite this, using Synthea™ to develop synthetic data for algorithm testing is easy to implement and can support educational materials for presenting the PASC algorithm.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Azhir A, Hügel J, Tian J, Cheng J, Bassett IV, Bell DS, et al. Precision phenotyping for curating research cohorts of patients with unexplained post-acute sequelae of COVID-19. Med. 2025;6(3):100532. DOI: 10.1016/j.medj.2024.10.009[2] Pezoulas VC, Zaridis DI, Mylona E, Androutsos C, Apostolidis K, Tachos NS, et al. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Comput Struct Biotechnol J. 2024;23:2892–910. DOI: 10.1016/j.csbj.2024.07.005
[3] Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, et al. SyntheaTM Novel coronavirus (COVID-19) model and synthetic data set. Intell Based Med. 2020;100007. DOI: 10.1016/j.ibmed.2020.100007
[4] SNOMED International. snomed-to-icd-10-mapper. 2024 [cited 2025 Apr 23]. Available from: https://github.com/IHTSDO/snomed-to-icd-10-mapper/releases/tag/2.0.0
[5] Tharra T, Hügel J. Synthetic PASC Data Generation. 2025 [cited 2025 Apr 23]. Available from: https://gitlab.gwdg.de/medinfpub/synthetic-pasc-data-generation



