Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Privacy-preserving federated analysis and harmonisation of heterogeneous datasets in NFDI4Health

Sofia Maria Siampani 1
Florian Schwarz 2
Franziska Jannasch 2
Tracy Bonsu Osei 1
Ines Perrar 3
Matthias B. Schulze 2,4
Ute Nöthlings 3
Katharina Nimptsch 1
Tobias Pischon 1,5,6
1Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Molecular Epidemiology Research Group, Berlin, Germany
2Department of Molecular Epidemiology, German Institute of Human Nutrition Potsdam-Rehbruecke, Nuthetal, Germany
3Department of Nutrition and Food Sciences, Nutritional Epidemiology, University of Bonn, Bonn, Germany
4Institute of Nutritional Science, University of Potsdam, Nuthetal, Germany
5Charité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt Universität zu Berlin, Berlin Institute of Health, Berlin, Germany
6Biobank Technology Platform, Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany

Text

Introduction: In Germany, cohort studies produce phenotypically rich but heterogeneous datasets that capture various aspects of the population's health and lifestyle. Combining and jointly analysing these datasets may enable large scientific opportunities by enhancement of statistical power, broadening generalisability, and examination of subgroups and minorities. However, data protection and governance regulations as well as heterogeneity in variables across cohorts present significant challenges. NFDI4Health, the national research data infrastructure for personal health data, addresses these challenges through the implementation of DataSHIELD [1], a federated analysis framework, combined with a central access point. This enables a sustainable way for secure, privacy-preserving analyses without transferring or sharing individual-level data. To make datasets compatible for joint analysis, a data harmonisation concept has been developed using Rmonize [2], a package by Maelstrom Research that streamlines the process in a semi-automatic way. The presentation will cover the concept of federated analysis and harmonisation, along with its technical implementation within NFDI4Health.

Methods: We implemented DataSHIELD's client-server architecture at participating data holding organisations (DHO), where individual-level data of cohort studies remains securely stored on Opal servers. Analysts interact with a central analysis environment hosted at the Max Delbrück Center that is connected to the DHOs, issuing commands that execute locally across all connected servers. The framework employs automated disclosure controls to ensure only non-disclosive summary statistics are returned. In addition, a data harmonisation workflow has been set-up using Rmonize. This process involves collecting metadata, defining a target dataschema, and developing harmonisation algorithms for each study-specific variable. DHOs execute these algorithms locally without exposing individual-level data. Researchers receive summary reports with descriptive statistics to validate the outcomes. The harmonised datasets are then used in DataSHIELD.

Results: We have successfully implemented and expanded the DataSHIELD network across Germany, enabling DHOs to host data within the federated environment. Currently, nine nodes are active, covering eleven studies. To extend DataSHIELD functionality, we developed R packages such as dsClusterAnalysis and dsSupportClient.

In parallel, we implemented the harmonisation workflow to standardise epidemiological variables (e.g., anthropometric measures, dietary patterns and chronic disease data) across cohort studies. So far, we have defined the harmonisation potential for 215 target variables across 5 studies. This experience demonstrated successful alignment for a substantial proportion of targeted variables (52% complete or partial), enabling their reuse in federated analyses. Challenges included managing variable granularity, defining plausible value ranges for quality checks and ensuring robust testing to minimise workload for DHOs.

Conclusion: The combination of DataSHIELD for federated analysis with a central analysis environment and implementation of a semi-automatic harmonisation concept provides a sustainable, privacy-preserving solution for collaborative health research that allows the combination of heterogenous datasets. By ensuring data compatibility and secure analysis workflows, this approach tackles key challenges in multi-study projects. Future work will focus on expanding the DataSHIELD network and the harmonisation service within NFDI4Health.

Acknowledgements: This work was done as part of the NFDI4Health Consortium (https://www.nfdi4health.de/). We gratefully acknowledge the financial support of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 442326535.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

[1] Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: Taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43(6):1929–44.
[2] Rmonize Package Documentation. Rmonize: A package for harmonizing epidemiological datasets in R. 2023 [cited 2025 Apr 9]. Available from: https://cran.r-project.org/package=Rmonize