Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Assessing the Quality of FHIR-Transformed Oncology Data

Vishnu Priya 1
Christian Gulden 2,3
Clara Fischer 4,5
Dorian Quell 1,6
Jasmin Ziegler 2,3,7
Ludwig Christian Hinske 1
Paul-Christian Volkmer 3,8
Iñaki Soto-Rey 1,6
1Digital Medicine, University Hospital of Augsburg, Augsburg, Germany
2Friedrich-Alexander-Universität Erlangen-Nürnberg, Institute for Medical Informatics, Biometrics and Epidemiology, Medical Informatics, Erlangen, Germany
3Bavarian Cancer Research Center (BZKF), Erlangen, Germany
4Medical Data Integration Center (MEDIZUKR), University Hospital Regensburg, Regensburg, Germany
5Bavarian Cancer Research Center (BZKF), Regensburg, Germany
6Bavarian Cancer Research Center (BZKF), Augsburg, Germany
7Medical Centre for Information and Communication Technology, Uniklinikum Erlangen, Erlangen, Germany
8Comprehensive Cancer Center Mainfranken, University Hospital Würzburg, Würzburg, Germany

Text

Introduction: The Bavarian Cancer Research Center (BZKF) Oncology Real-World Data Platform (oRWDP) integrates oncology datasets from six university hospitals into a federated research network, enabling scalable real world evidence (RWE) generation [1]. However, the integration and transformation of heterogeneous clinical datasets into a unified and standardized format via Extract-Transform-Load (ETL) processes introduces substantial data quality (DQ) challenges. Ensuring the reliability and fitness of this data necessitates automated data quality assessment (DQA). This study aims to identify and quantify data quality errors in pre- and post-ETL transformations of XML-encoded uniform oncological basic data set (oBDS) version 2 into the harmonized FHIR-based data model of the German Cancer Consortium (DKTK) [2].

Methods: In this single-center study conducted at University Hospital Augsburg (UKA), we evaluated the data processing pipeline implemented by the BZKF [1]. The Python package FHIR-PYrate (v0.2.1), was used to query FHIR resources and generate flat data subsets from a local server instance [3]. Automated DQA was executed using Great Expectations (GX) (v1.4.2), an open-source data validation library integrated with configurable rule engine [4]. Furthermore, given the variability in dimensions and interpretations of data quality, this study adopts three core known dimensions: completeness, conformance, and plausibility [5].

Results: Analysis of oncology records (with the majority of data within 2018–2023) revealed transformation-related errors when comparing the oBDS XML source to FHIR-transformed data, particularly impacting completeness, and logical plausibility. Key findings include issues related to temporal plausibility, such as outliers in 'birth date', 'deceased date' and 'condition date,' commonly occurred in the primary dataset, and persisted in the obds-to-FHIR transformed records when cross-referenced. Sequential data analysis revealed specific point anomalies, including inconsistent or implausible date entries. We identified a 2% completeness deviation in overall count by diagnoses, as classified according to ICD-10 codes, which indicates a potential data loss during transformation.

Discussion: While existing FHIR validators are effective in assessing syntactic and structural constraints of individual resources, they do not assess quality across multiple inter-relational profiles. Our evaluation, therefore, focused on identifying implausible and logical errors that arise within our ETL pipeline. During this process, we also identified specific ETL-processing issues. For instance, in oBDS version 2, unknown day and month fields (e.g., 00.00.2025) were substituted with mid-year default values, to make downstream analytical comparisons possible. However, this imputation approach led to inaccuracies when comparing temporal relationships across resources. Furthermore, the 2% deviation was partially caused by patients who were only present in the oBDS as a part of tumor conference visits, as such cases were not transformed to FHIR resources in the past. In our new version of the ETL job, we also included tumor conference cases which should address this deviation.

Conclusion: The DQ reports not only support ETL correctness but also help to ensure independent quality of real-world-data (RWD). Moving forward, we plan to extend these quality tests to multiple sites and integrate our proof-of-concept into the BZKF pipeline and later make it publicly available for broader adoption.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

[1] Ziegler J, Erpenbeck MP, Fuchs T, Saibold A, Volkmer P-C, Schmidt G, et al. Bridging data silos in oncology with modular software for federated analysis on Fast Healthcare Interoperability Resources: multisite implementation study. J Med Internet Res. 2025;27:e65681.
[2] Lambarki M, Kern J, Croft D, Engels C, Deppenwiese N, Kerscher A, et al. Oncology on FHIR: a data model for distributed cancer research. Stud Health Technol Inform. 2021;278:203–10.
[3] Hosch R, Baldini G, Parmar V, Borys K, Koitka S, Engelke M, et al. FHIR-PYrate: a data science friendly Python package to query FHIR servers. BMC Health Serv Res. 2023;23(1):734.
[4] great-expectations/great_expectations: Always know what to expect from your data. GitHub; 2025 [cited 2025 Apr 25]. Available from: https://github.com/great-expectations/great_expectations
[5] Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4(1):1244.