70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Synthesizing realistic cancer registry data – an analysis using two different approaches
Text
Introduction: Open Science thrives not only on publishing results, but sharing the data used and the analysis scripts. Synthetic data are currently often used in the context of clinical trials, sometimes to increase case numbers, to use as control data, or to enable broader sharing of clinical trial data. When analysing cancer registry (CR) data, which can be requested for specific research purposes, small case numbers are seldom a problem. But due to privacy concerns the data cannot be shared outside of the researchers that requested the data. Realistic synthetic data might overcome this problem.
Methods: We used data on lung cancer cases diagnosed between 2016 and 2019 from four German CRs consisting of nine variables: federal state (4 levels), ICD-10 diagnosis (6 levels), grouped morphology codes (8 levels), UICC stage (5 levels), date of diagnosis (days since 1970-01-01), age at diagnosis (years), sex (2 levels), vital status at end of follow-up (2 levels), and follow-up time (days). We synthesized data from three different datasets: The full dataset, a homogenous subset containing only stage IV small cell lung cancer (esSCLC) cases and a random subset of the same size as the esSCLC one.
Using two different methods for synthesizing data from the R packages arf [1] and synthpop [2], we generated datasets of varying size (n = 10%-100% of the original data, in 10% increments) using different proportions of the dataset (p = 25%-100%, in 25% increments). Every combination of n and p was used ten times.
We compared 1) data structure using univariate Kullback-Leibler divergence (KLD) and bivariate pairwise correlation difference (PCD), 2) analysis outcomes using median overall survival (OS), and 3) the computing time needed for synthesizing the data.
Results: The original dataset included 60,000 cases, the smaller datasets contained 5,000 cases each.
KLD was largely unaffected by p and n but showed variation between variables. PCD varied with p and decreased with a higher n using arf, while n didn’t influence PCD using synthpop.
The heterogeneous subset showed the highest KLD and PCD values, followed by the esSCLC subset and the original dataset.
Median OS estimates from the synthetic data closely resembled those from the original dataset. Differences were more pronounced in smaller subsets, with sampling proportion p having the strongest influence.
Synthesizing with synthpop took about 10 to 50 times longer than synthesizing with arf. With both packages, synthesis took longer the larger n was, but only synthpop’s computation time was affected by p.
Discussion: Synthetic data generated with arf and synthpop show sufficient similarity to the original data in terms of structure and analytical outcomes, particularly in large or homogeneous datasets.
Conclusion: The results, one of the first syntheses using real German CR data, suggest that synthetic data of CR data may offer a viable solution to meet Open Science demands without compromising data protection or violating restrictions on data sharing.
Future analyses will explore whether this similarity also applies when the data contains one-to-many relationships, such as treatments, and to multivariable analyses.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Watson D, Blesch K, Kapar J, Wright M. Adversarial Random Forests for Density Estimation and Generative Modeling. In: 26th International Conference on Artificial Intelligence and Statistics (AISTATS); 2023 Apr 25-27; Valencia, Spain.[2] Nowok B, Raab GM, Dibben C. synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software. 2016;74(11):1–26. DOI: 10.18637/jss.v074.i11



