70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Transforming Oncology Data for Artificial Intelligence: Leveraging oBDS for Scalable Medical Research
2German Research Center for Artificial Intelligence (DFKI), Branch Trier, Trier, Germany
3Department of Dermatology, University Clinic Münster, Münster, Germany
4Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
5Business Information Systems II, University of Trier, Trier, Germany
6Department of Dermatology, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Text
Introduction: The Oncology Base Dataset (oBDS) is a standardized reporting framework used in Germany for documenting oncology cases, playing a crucial role in ensuring data consistency and availability for cancer research and treatment planning [1]. However, leveraging oBDS data for predictive modeling and decision support poses challenges due to its hierarchical format being optimized for reporting, and thus, it is not directly usable for most artificial intelligence (AI) methods [2]. Moreover, ensuring the wide availability of data from different institutions remains a significant legal challenge in Germany [3]. In this study, we investigated the general potential of utilizing the oBDS for AI-driven research as part of the project 'KI-basierte Anonymisierung in der Medizin' (KI-AIM), which explores the application of anonymization and synthetization techniques in healthcare to increase data availability for AI research.
Methods: To overcome the limitations of oBDS for AI applications, we implemented an Extract, Transform, Load (ETL) process, that restructures the dataset into a two-dimensional format, making it more suitable for machine learning. The ETL process utilizes the Last Value Carried Forward (LVCF) method to fill missing values with the most recent available data point, ensuring completeness and consistency in patient records [4]. This step is critical for medical datasets where data collection intervals often vary across cases. Next, we evaluated the utility of the transformed dataset using a straightforward AI analysis. This involved training foundational machine learning and deep learning models on the processed oBDS data to classify the current tumor stage, evaluating their performance using key metrics such as accuracy and F1-score.
Results: Our analysis demonstrated that the transformed oBDS dataset effectively supports AI applications. Across all evaluated classifiers, including Support Vector Classifier (SVC), Logistic Regression (LR), K-Nearest Neighbors (KNN), Random Forest Classifier (RFC), Decision Tree Classifier (DTC) and Multi-Layer-Perceptron (MLP) we observed average Accuracy of 79.90% and average F1-Score of 79.12%, with individual models achieving Accuracy between 65.25% (LR) and 93.51% (RFC) and F1-Scores between 63.58% (LR) and 93.41% (RFC).
Conclusion: The successful transformation of oBDS data into an AI-compatible format paves the way for more advanced analyses and shows the potential of its application for oncology research. This process is especially important in the context of the Medical Informatics Initiative (MII), as it ensures that the newly established Oncology Core Dataset Module becomes usable for AI applications. As part of our ongoing project, we plan to integrate this dataset with anonymization and synthetization techniques to create a robust and privacy-preserving framework for AI-driven oncology research. Specifically, we will explore both the use of k-anonymity and the application of generative models to synthesize patient data, creating realistic and anonymous datasets for training and validating AI models. By combining these approaches, we aim to develop a sophisticated medical use case that demonstrates the potential of anonymized and synthesized data for improving cancer diagnosis, treatment, and patient outcomes. Future work will focus on evaluating the effectiveness of these techniques in a real-world setting, with the goal of translating our findings into clinical practice.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Literatur
[1] Klinkhammer-Schalke M, Kleihues van Tol K, Jurkschat R, Meyer M, Katalinic A, Holleczek B, Braulke F, Schneider C, Franke B, Hoffmann W, Nennecke A; Mitglieder der AG Daten. Der einheitliche onkologische Basisdatensatz (oBDS). Forum. 2024;39:191–195. DOI: 10.1007/s12312-024-01320-1[2] Reddy GT, Reddy MPK, Lakshmanna K, Kaluri R, Rajput DS, Srivastava G, Baker T. Analysis of Dimensionality Reduction Techniques on Big Data. IEEE Access. 2020;8:54776–54788. DOI: 10.1109/ACCESS.2020.2980942
[3] Eger T, Scheufen M. Data Sharing in Deutschland: Theorie, Empirie und europäische Gesetzgebung. Wirtschaftsdienst. 2024;104:725–729. DOI: 10.2478/wd-2024-0183
[4] Twisk J, de Vente W. Attrition in longitudinal studies. How to deal with missing data. J Clin Epidemiol. 2002;55:329–337. DOI: 10.1016/s0895-4356(01)00476-0



