Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

A Secure Interactive and High Performance Processing Environment for Collaborative Machine Learning Tasks on Large Data

Philip Zaschke 1
Jendrik Richter 1
Dagmar Krefting 1
1Institut für medizinische Informatik, Universitätsmedizin Göttingen, Georg August Universität Göttingen, Göttingen, Germany

Text

Introduction: Research needs data – however, the potential is not fully exploited, as massive unused amounts are generated in the clinical environment. The Medical Informatics Initiative (MII) provides the governance structures for data access requests, as well as the technical infrastructure for providing data via the data sharing framework [1], [2]. Once data has been provided to researchers, it is up to the user to ensure an appropriate processing environment.

In the the multi-center project Somnolink, large sleep data is planned to be collaboratively used for the prediction of obstructive sleep apnea phenotypes, therapy options and compliance. A secure but collaboratively usable data processing environment must be assured as we plan to train artificial intelligences with broad consent patients and using consent free research data with the german Gesundheitsdatennutzungsgesetz.

State of the art: Usually, for machine learning tasks, the analysis pipeline can be defined as the steps (a) data extraction, (b) exploration/curation, (c) preparation of training and (d) training. Step (b) is usually performed interactively (e.g. with Jupyter Notebooks), while step (d) can be handled automatically in large computing environments, for example on a high-performance computing cluster (HPC).

Previously, we integrated HPC into the biomedical research data management system XNAT (Extensible Neuroimaging Archive Toolkit) [3], [4]. This has allowed for batch processing XNAT projects in machine learning tasks on a shared partition, but did not enable interactive and secure data analysis.

Therefore we set up a collaborative processing environment encompassing a research data management system as an interactive data exploration/curation and secure HPC analysis.

Concept: We extended the previous XNAT-HPC environment by (i) a Jupyter server in the same network segment of the university compute center’s (GWDG) cloud and (ii) a secure HPC protocol [5] again provided by GWDG. The required components – a containerized adopted Jupyter image and an XNAT plugin – are provided by the XNAT community. In XNAT, data can be protected by its access control feature and can be selected to spawn a Jupyter notebook with Python data access. For training, we combined our XNAT-HPC pipeline with the encrypted processing approach secure HPC.

Implementation: We implemented and evaluated our workflow and created an exemplary project in XNAT filled with synthesized data. We accessed the files through the Jupyter Notebook and successfully transferred them into the secure HPC pipeline for automatic encryption and secure processing on the HPC system.

Lessons learned: We established a secure processing workflow consisting of data exploration using Jupyter Notebooks in XNAT and secure processing by combining our XNAT-HPC pipeline with secure HPC.

It ensures extra security for our training data on the normally shared-HPC environment. While XNAT is optimized for biosignal and image data, its open-source design allows it to support and extend to other data types as well. Limitations of this infrastructure are characterized first by a local temporal storage for creating an encrypted data container within the secure HPC approach. Secondly, this workflow does not cover data uploading into XNAT after its provision to researchers by the MII data management office. Thirdly, while XNAT projects support collaborative access among permitted users via access control, each access of a Jupyter Notebook is limited to the individual user.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Semler SC, Boeker M, Eils R, Krefting D, Loeffler M, Bussmann J, et al. Die Medizininformatik-Initiative im Überblick – Aufbau einer Gesundheitsforschungsdateninfrastruktur in Deutschland. Bundesgesundheitsbl. 2024 Jun 1;67(6):616–28.
[2] Hund H, Wettstein R, Heidt CM, Fegeler C. Executing Distributed Healthcare and Research Processes – The HiGHmed Data Sharing Framework. In: German Medical Data Sciences: Bringing Data to Life. IOS Press; 2021. (Studies in Health Technology and Informatics). p. 126–33.
[3] Marcus DS, Olsen TR, Ramaratnam M, Buckner RL. The Extensible Neuroimaging Archive Toolkit: an informatics platform for managing, exploring, and sharing neuroimaging data. Neuroinformatics. 2007;5(1):11–34.
[4] Zaschke P, Hempel P, Bowden J, Bender T, Hanß S, Spicher N, Krefting D. Extending the Biosignal and Imaging Data Managing Platform XNAT by High Performance Computing for Reproducible Processing. In: Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 853. DOI: 10.3205/24gmds113
[5] Nolte H, Spicher N, Russel A, Ehlers T, Krey S, Krefting D, et al. Secure HPC: A workflow providing a secure partition on an HPC system. Future Generation Computer Systems. 2023 Apr 1;141:677–91.