Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Validated Data Migration for Large-Scale Pulmonary Hypertension Registry: From Access to REDCap

Jeeva Sam 1,2
Meike T. Funderich 1,2
Achim Michel-Backofen 3
Romina Blasini 3
Patrick Janetzko 1,2
Khodr Tello 1,2
Werner Seeger 1,2
Raphael W. Majeed 1,2
1Department of Internal Medicine, Universities of Giessen and Marburg Lung Center (UGMLC), Member of the German Center for Lung Research (DZL), Gießen, Germany
2Institute for Lung Health (ILH), Cardio-Pulmonary Institute (CPI), Gießen, Germany
3Institute of Medical Informatics, Justus-Liebig-University, Gießen, Germany

Text

Introduction: Pulmonary Hypertension (PH) is a complex disease that requires high-quality data for both research and clinical decision-making [1]. The Gießen PH registry, one of Germany’s largest, comprises over 5,000 patients and more than 600 variables spanning two decades. It was initially managed in Microsoft Access, which was limited to single-user access, hindered scalability, and introduced data quality issues like duplicated records, missing values, and inconsistent formatting. The lack of interoperability also prevented integration with external clinical data sources.

To overcome these limitations, we migrated the registry to REDCap, a scalable Electronic Data Capture (EDC) system that supports multi-user access, integration with clinical systems via APIs, and a standardized structure to reduce inconsistencies, as demonstrated in previous implementations [2]. This transition eliminates Access-related inefficiencies, ensuring the PH registry is research-ready.

Methods: To ensure a scalable, reproducible, and error-resilient data migration, we implemented a structured Extract, Transform, Load (ETL) pipeline, drawing on strategies described by Dunn et al. [2].

Data were extracted from Microsoft Access using Open Database Connectivity (ODBC), with identifiers standardized to maintain referential integrity. The data were processed in-memory and enriched with clinical data, including lung function tests, Right Heart Catheter (RHC) results, and laboratory reports. Schema mapping aligned the data with REDCap’s structure for consistent integration. The final stage uploaded processed data into REDCap using an idempotent strategy.

????This ensured full record replacement without duplication, supported by a deletion-and-replacement workflow that preserved data integrity. The entire process was automated via REDCap’s API.

To assess data quality post-migration, we developed a validation framework using Jupyter Notebooks and statistical analysis, incorporating practices like those described by Fredericks-Younger et al. [3]. This included automated anomaly detection and consistency checks. Discrepancies in mappings and classifications were identified and visualized using statistical profiling and exploratory analysis tools.

Results: During the migration of 5,000 patient records to REDCap, the ETL pipeline enabled efficient, consistent data processing with minimal manual input, ensuring clean, duplicate-free uploads and preserved identifier integrity. Post-migration validation revealed several data quality issues. Approximately 18.75% of categorical values were incorrectly mapped, leading to missing entries. Duplicate visit dates affected 48 patients and caused record misalignment.

Around 3% of classification values, including New York Heart Association (NYHA) scores, were lost due to schema mismatches and required manual review.The validation framework also identified 100 patients with pre-existing errors in the source data, including invalid visit dates and birthdates. These issues were reviewed and corrected by study nurses, resulting in improved registry consistency and readiness for clinical analysis.

Conclusion: The migration to REDCap improved scalability, data integrity, and clinical data integration using an automated ETL pipeline. Although the pipeline addressed many issues, inconsistencies in the original dataset remained and could impact PH analysis. Manual validation was necessary in these cases, and full automation remains unlikely due to the complexity of older clinical records.

While REDCap provides structured storage, real-time synchronization with clinical systems was not implemented in this project. Future work should focus on automated, real-time pipelines and better error monitoring to support high-quality PH research [4].

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Rhodes CJ, Sweatt AJ, Maron BA. Harnessing Big Data to Advance Treatment and Understanding of Pulmonary Hypertension. Circ Res. 2022 Apr 29;130(9):1423-1444. DOI: 10.1161/CIRCRESAHA.121.319969
[2] Dunn WD, Cobb J, Levey AI, Gutman DA. REDLetr: Workflow and tools to support the migration of legacy clinical data capture systems to REDCap. Int J Med Inform. 2016;93:103–110. DOI: 10.1016/j.ijmedinf.2016.06.015
[3] Fredericks-Younger J, Greenberg P, Andrews T, et al. Leveraging the functionality of Research Electronic Data Capture (REDCap) to enhance data collection and quality in the Opioid Analgesic Reduction Study. Clin Trials. 2024;21(3):381-389. DOI: 10.1177/17407745231212190
[4] Amin W, Kanakasabai S, Grout R, Butler J, Michael S, Schleyer T. Real-time data synchronization: Assessing the implementation of REDCap CDIS (Clinical Data Interoperability Service) for EHR systems. Journal of Clinical and Translational Science. 2025;9(s1):107.DOI: 10.1017/cts.2024.975