Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Leveraging ChatGPT for systematic reviews – a feasibility study and framework proposal

Katharina Appel 1
Isabel Schnorr 1
Jörg Janne Vehreschild 1
Daniel Maier 1
1Goethe-Universität Frankfurt, Institut für Digitale Medizin und Klinische Datenwissenschaften, Frankfurt am Main, Germany

Text

Introduction: Systematic reviews are an important instrument to synthesize existing knowledge for evidence-based medicine [1]. While a manually conducted systematic review is a highly resource-intensive and error-prone project, the implementation of Large Language Models (LLMs) could support critical steps such as the screening of titles and abstracts and full-text data extraction. While the performance of LLMs for systematic reviews has been demonstrated in recent publications [1], [2], [3], a user-friendly, accessible framework is still missing. This study proposes and evaluates such an LLM-supported systematic review-framework.

Methods: Based on a published systematic review about clinical prognostic scores for COVID-19 [4], we developed and validated an LLM-supported framework aligned with PRISMA guidelines. Based on an evaluation of performance and cost-effectiveness, we opted for implementing OpenAI’s Application Programming Interface (API) and the o3 model [5]. In two subsequent steps, we called the API to (i) screen each article’s title and abstract as well as (ii) it’s full text (provided as an Extensible Markup Language [XML] file). Therefore, we engineered a dynamic prompt with detailed context information regarding the LLM’s persona, the instruction’s target and its topic, the specific task, the inclusion and exclusion criteria, and the output format we wished to receive. More specifically, the LLM was requested to check each inclusion and exclusion criterion and to create a brief explanation. Articles meeting the criteria were included and excluded otherwise. Performance metrics (accuracy, Cohen’s κ, sensitivity, specificity, false-positive rate [FPR], false-negative rate (FNR)] were compared to the results of the original human review that served as ground truth.

Results: We included 1383 scientific articles. Preliminary results for a subset of these articles demonstrate the LLM's performance in title and abstract screening with an accuracy of 91.6%, a low FNR (6.8%) and a Cohen’s κ of 0.75 (95% CI 0.69-0.80). Moderate overall accuracy was observed during the subsequent full text screening phase (accuracy 77.9%, FNR 10.9%, Cohen’s κ of 0.49 [95% CI 0.37-0.62]).

Conclusion: The findings suggest the LLM-based approach could substantially accelerate screening efforts and reduce manual workload. However, the performance decline revealed for the fulltext screening requires further investigation and demonstrates the contextual complexity of the original human review’s study aim. We propose that LLMs such as o3 are best utilized to assist researchers in the systematic review process, but not to perform a fully automated systematic review.

Daniel Maier received speaker honoraria from Free University Berlin and travel compensation from IQVIA. Jörg Janne Vehreschild received payments or honoraria from Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), University Hospital Freiburg/ Congress and Communication, Academy for Infectious Medicine, University Manchester, German Society for Infectious Diseases (DGI), Ärztekammer Nordrhein, Ärztekammer Hessen, University Hospital Aachen, Back Bay Strategies, German Society for Internal Medicine (DGIM), Shionogi, Molecular Health, Netzwerk Universitätsmedizin, Janssen, NordForsk, Biontech, APOGEPHA, German Cancer Consortium (DKTK), University Hospital Oldenburg. Jörg Janne Vehreschild has grants from Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), German Federal Ministry of Education and Research (BMBF), Deutsches Zetrum für Luft- und Raumfahrt (DLR), University of Bristol, Rigshospitalet Copenhagen, German Network University Medicine, German Cancer, Consortium (DKTK), German Federal Ministry of Health (BMG), European Union. Jörg Janne Vehreschild received support for attending meetings and/or travel from German Centre for Infection Research (DZIF), University Manchester, German Society for Infectious Diseases (DGI), University Hospital Aachen, German Society for Internal Medicine (DGIM), Netzwerk Universitätsmedizin, German Cancer Consortium (DKTK). Jörg Janne Vehreschild participated on Data Safety Monitoring Boards or Advisory Boards of Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), Academy for Infectious Medicine, University Manchester, German Society for Infectious Diseases (DGI), German Society for Internal Medicine (DGIM), Netzwerk Universitätsmedizin, Janssen, Biontech. Zaira R. All other authors report no conflicts of interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Affengruber L, van der Maten MM, Spiero I, Nussbaumer-Streit B, Mahmic-Kaknjo M, Ellen ME, et al. An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review. BMC Med Res Methodol. 2024;24:210. DOI: 10.1186/s12874-024-02320-4
[2] Hanegraaf P, Wondimu A, Mosselman JJ, de Jong R, Abogunrin S, Querios L, et al. Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review. BMJ Open.2024; 14(3):e076912. DOI: 10.1136/bmjopen-2023-076912
[3] Delgado-Chaves FM, Jennings MJ, Atalaia A, Wolff J, Horvath R, Mamdouh Z M, Baumbach J, Baumbach L. Transforming literature screening: The emerging role of large language models in systematic reviews. PNAS. 2025;122(2):e2411962122. DOI: 10.1073/pnas.2411962122
[4] Appel KS, Geisler R, Maier D, Miljukov O, Hopff SM, Vehreschild JJ. A Systematic review of Predictor Composition, Outcomes, Risk of Bias, and Validation of COVID-19 Prognostic Scores. Clin Infect Dis. 2024 Apr 15;78(4):889-899. DOI: 10.1093/cid/ciad618
[5] Open AI. o3 Our most powerful reasoning model. San Francisco, California, U.S.: Open AI; 2025 [cited 27 June 2025]. Available from: https://platform.openai.com/docs/models/o3