70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Linking Clinical Trials and Publications: Enhancing DRKS Bibliographic Metadata via Reference Matching
2Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, QUEST Center for Responsible Research, Berlin, Germany
Text
Introduction: Clinical trials play a vital role in medical research by providing evidence-based assessments of the efficacy and safety of new therapies. Reusing data from clinical trial registries – such as the German Clinical Trials Register (DRKS) (https://www.drks.de/) – enhances centralized findability, for example via the NFDI4Health Study Hub [1], or through meta-research analyses like the BIH QUEST Clinical Trials Transparency Dashboard [2]. Scientific work related to these trials is also crucial, yet often incomplete or unavailable via the DRKS API; persistent identifiers and links to scientific publications are frequently missing [3]. To address this, we apply automated reference matching techniques to identify and link relevant literature, making complete citation data accessible through the Health Study Hub.
Methods: We exported the complete DRKS data, via the JSON API, comprising 18,143 trials. From these, we extracted publication metadata from the “trialResults.publications” section (n=6,016), resulting in 10,621 individual publication records. Of these, 5,196 entries lacked both a link and an uploaded document. Although the document field was consistently empty, many corresponding files were available for download via the DRKS website. We implemented a web scraping routine and identified and retrieved 3,451 valid document URLs.
To extract structured bibliographic metadata, we applied regular expressions to identify digital object identifier (DOIs), PubMed IDs (PMID and PMC), and web URLs from the description field. We further parsed free-text references using the GROBID (GeneRation Of BIbliographic Data) toolkit [4], to generate structured citation components (title, authors, journal, year, etc.). To normalize entries and retrieve DOIs, we queried the Crossref API [5] and applied a weighted title and first author string similarity via the Gestalt algorithm with a similarity threshold of 0.75 to minimize false positives.
Results: We applied our approach on the 5,196 items lacking both links and documents using regular expressions, we extracted DOIs from 896 entries, 260 PubMed IDs (PMID) and 140 PMC IDs, and 232 web URLs . By combining GROBID parsing with Crossref matching we identified an additional 568 DOIs. Another 649 references are detected by GROBID could not be resolved via Crossref. Further, 3,669 descriptions as not detected as references by GROBID. Among these, 1,115 were duplicates (e.g., “Studienprotokoll” occurred 252 times), and 573 contained fewer than three words – these were manually confirmed as non-references. The remaining 1,332 entries as well as the 649 detected but not resolved descriptions still require further assessment but likely include a mix of false positives and false negatives.
Discussion: Automated bibliographic enrichment using regular expressions, GROBID and CrossRef significantly improves metadata quality and facilitates linking studies to related scientific literature. This increases FAIRness and improves the reusability of data. A full evaluation is still ongoing, although the current implementation represents a significant step forward.
Conclusion: We present the first version of an automated pipeline that detects bibliographic references in DRKS. A subset of high-confidence matches has already been integrated into the Health Study Hub, contributing to a more connected and transparent research ecosystem and enhances the FAIRness of clinical trials.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Literatur
[1] Darms J, Clemens V, Gonzalez-Ocanto M, Brünings-Kuppe C, Cici S, Fluck J. The German Central Health Study Hub – A Service to Find and Publish Clinical, Public Health and Epidemiolocal Studies and Associated Documents. In: Röhrig R, Grabe N, Hübner UH, Jung K, Sax U, Schmidt CO, et al, editors. German Medical Data Sciences 2024. Health – Thinking, Researching and Acting Together. Proceedings of the 69th Annual Meeting of the German Association of Medical Informatics, Biometry, and Epidemiology e.V. (gmds) 2024 in Dresden, Germany. IOS Press; 2024. (Studies in Health Technology and Informatics; 317). DOI: 10.3233/SHTI240847[2] Franzen DL, Carlisle BG, Salholz-Hillel M, Riedel N, Strech D. Institutional dashboards on clinical trial transparency for University Medical Centers: A case study. Naudet F, editor. PLoS Med. 2023 Mar 21;20(3):e1004175.
[3] Salholz-Hillel M, Strech D, Carlisle BG. Results publications are inadequately linked to trial registrations: An automated pipeline and evaluation of German university medical centers. Clinical Trials. 2022 Jun;19(3):337–46.
[4] kermitt2/grobid: A machine learning software for extracting information from scholarly documents. [cited 2025 Apr 24]. Available from: https://github.com/kermitt2/grobid
[5] CrossRef API. [cited 2025 Apr 24]. Available from: https://api.crossref.org/swagger-ui/index.html



