Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Data Dictionaries: Key to Secondary Use of Real-World Data in Oncology

Abishaa Vengadeswaran 1
Andrea Wolf 2
Sara Bachir 1
Daniel Brucker 2
Andreas Heidenreich 3
Daniel Maier 4
Timo Schneider 2
Linnea Schumann 4
Janina Wörmann 2
Dennis Kadioglu 1,3
1Goethe University Frankfurt, University Medicine, Institute of Medical Informatics (IMI), Frankfurt am Main, Germany
2Goethe University Frankfurt, University Medicine, University Cancer Center, Frankfurt am Main, Germany
3Goethe University Frankfurt, University Hospital, Data Integration Center (DIC), Frankfurt am Main, Germany
4Goethe University Frankfurt, Faculty of Medicine, Institute for Digital Medicine and Clinical Data Sciences, Frankfurt am Main, Germany

Text

Introduction: As part of the DIGital Institute for Cancer Outcomes REsearch (DigiCORE) network, the Institute of Medical Informatics, the University Cancer Center (UCT), and the Institute for Digital Medicine and Clinical Data Sciences aim to provide the Minimal Essential Description of Cancer (MEDOC) [1] for future studies using the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). Due to the heterogeneity of data and differing advantages of various data sources, i.e., the hospital information system (HIS) and the tumor documentation system, two major challenges arise: (1) identifying and processing local data for MEDOC, and (2) transferring data from MEDOC to OMOP [2].

Methods: Tumor documentation data from the UCT, along with additional HIS data from the Data Integration Center, were examined to create a data dictionary including essential information on data structure, source systems, and item interpretation [3]. An Implementation Guide for transforming MEDOC to the OMOP CDM was developed in collaboration with European partners to streamline data provision.

Results: Identifying and documenting the various challenges of primary data sources requires precise definitions of the data items that populate the MEDOC concepts. E.g., in the case of the “date of diagnosis”, both the registration date or the histological confirmation date could be used, although they differ in accuracy. Similarly, the “date-last-seen” variable, which is not explicitly documented, can be derived from multiple sources such as visit dates, discharge dates and other claims data. To ensure a shared understanding of the data among all stakeholders, clinicians, data collectors, data scientists, and software developers, a data dictionary was developed to ensure that data extraction and analysis are based on a common understanding.

Discussion: While the Implementation Guide is essential to standardize data across institutions, local data dictionaries highlight the importance of site-specific adaptations and the precise identification of relevant data from the primary systems that would otherwise go undocumented. However, to enable and facilitate in-depth understanding and, consequently, reliable research, scientists and developers must collaboratively document a comprehensive data dictionary, both for mapping local data to MEDOC and MEDOC to OMOP. A metadata repository can be used to document such data dictionaries, enabling the standardized capture, management, and maintenance of these information [4].

Conclusion: To gain a better understanding of the data for secondary use, data dictionaries must be established to prevent misunderstandings among stakeholders and to sustainably document harmonization adjustments made during data integration. Although the creation of such documentation may initially entail increased effort, the long-term advantages of transparent and sustainable documentation – particularly in the context of secondary use of data - are essential for ensuring data validity and reliability, and have the potential to save significant resources in future data-driven clinical research.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Mahon, P, Chatzitheofilou, I, Dekker, A, et al. A federated learning system for precision oncology in Europe: DigiONE. Nat Med. 2024;30:334–337. DOI: 10.1038/s41591-023-02715-8
[2] Observational Health Data Sciences and Informatics (OHDSI). The OMOP Common Data Model. OHDSI; [cited 2025 Apr 25]. Available from: https://www.ohdsi.org/data-standardization/the-common-data-model/
[3] Buchanan EM, Crain SE, Cunningham AL, et al. Getting Started Creating Data Dictionaries: How to Create a Shareable Data Set. Advances in Methods and Practices in Psychological Science. 2021;4(1). DOI: 10.1177/2515245920928007
[4] Hegselmann S, Storck M, Gessner S, et al. Pragmatic MDR: a metadata repository with bottom-up standardization of medical metadata through reuse. BMC Med Inform Decis Mak. 2021;21:160. DOI: 10.1186/s12911-021-01524-8