Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Enabling Reproducible Data Quality Assessments via Reusable Metadata: Integrating dataquieR with the MDM Portal

Stephan Struckmann 1
Jörg Henke 1
Celina Seelinger 2
Max Blumenstock 2
Martin Dugas 2
Elena Salogni 1
Elisa Kasbohm 1
Carsten Oliver Schmidt 1
1Universitätsmedizin Greifswald, Greifswald, Germany
2Universitätsklinikum Heidelberg, Heidelberg, Germany

Text

Introduction: Data quality can be operationalised as the degree to which observed data conform to requirements derived from the study design or the context of use. As such, high-quality data are essential for obtaining valid research results. Several frameworks and tools are available to guide and conduct data quality assessments [1], [2]. However, conducting comprehensive data quality assessments efficiently relies heavily on the availability of metadata that describe the requirements, such as expected value ranges, admissible codes, and distributional assumptions. Creating and maintaining such metadata can be a labour- and time-intensive process, particularly in data collections with a large number of variables.

State of the art: Metadata defining data expectations are typically implemented in custom code or local documentation, often in non-interoperable formats. This approach hampers transparency and reuse, and limits the comparability of data quality reports. Formats for handling data quality-related information in machine-readable form have been proposed [3], but lack interoperability with existing standards. Conversely, common data models such as the Operational Data Model (ODM) have not been extended to support rich, data quality-specific metadata. Therefore, while repositories such as the ODM-based Medical Data Models (MDM) Portal [4] are used to find and create electronic case report forms, metadata relevant to data quality assessments are not available.

Concept: To address this limitation, we integrated the metadata approach used by the R package dataquieR [3] with the MDM Portal. We extended the ODM format using custom alias tags to embed item-level metadata relevant to data quality assessments. This extension enables seamless import and storage of dataquieR-formatted metadata in the MDM, leveraging its capabilities to not only display and generate medical forms but also to directly provide the metadata required for conducting data quality assessments.

Implementation: The workflow is illustrated based on the Study of Health in Pomerania (SHIP), where the full metadata of the SHIP-START-4 data collection, including data quality-relevant information, has been integrated into the MDM. The enriched ODM files can be exported directly into Excel-based metadata control files for use with dataquieR. Quality assessments can then be performed locally in R, via containerised environments (e.g. Docker), or through online platforms powered by ShinyProxy.

Lessons learned: This workflow reduces redundant work, increases transparency, and supports harmonised quality reporting by making underlying assumptions transparent in an interoperable format, consistent with the FAIR data principles [5]. Potential reuse contexts include, amongst others, secondary analyses on the original study data, metadata transfer to comparable variables from other studies for monitoring purposes, or harmonising data across studies. The outlined approach is generic and scalable to other studies. By integrating data quality-related metadata in central repositories, the MDM portal and potentially other infrastructures, such as the Health Study Hub by NFDI4Health, may contribute to improved scientific reproducibility.

??This work was supported by the German Research Foundation (DFG: SCHM 2744/3-4, NFDI4Health Consortium, DFG project number 442326535, DFG: DU 352/14-4).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Applied Sciences. 2022;12(9). DOI: 10.3390/app12094238
[2] Schwabe D, Becker K, Seyferth M, Klass A, Schaeffter T. The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. NPJ Digit Med. 2024;7(1):203. DOI: 10.1038/s41746-024-01196-4
[3] Struckmann S, Marino J, Kasbohm E, Salogni E, Schmidt CO. dataquieR 2: An updated R package for FAIR data quality assessments in observational studies and electronic health record data. Journal of Open Source Software. 2024;9(98):6581. DOI: 10.21105/joss.06581
[4] Reichenpfader D, Glauser R, Dugas M, Denecke K. Assessing and Improving the Usability of the Medical Data Models Portal. Stud Health Technol Inform. 2020;271:199-206. DOI: 10.3233/SHTI200097
[5] Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. DOI: 10.1038/sdata.2016.18