Leveraging ChatGPT for systematic reviews – a feasibility study and framework proposal

25gmds156 10.3205/25gmds156 urn:nbn:de:0183-25gmds1562 Meeting Abstract Leveraging ChatGPT for systematic reviews – a feasibility study and framework proposal Appel Appel Katharina K

Goethe-Universität Frankfurt, Institut für Digitale Medizin und Klinische Datenwissenschaften, Frankfurt am Main, Germany

author Schnorr Schnorr Isabel I

Goethe-Universität Frankfurt, Institut für Digitale Medizin und Klinische Datenwissenschaften, Frankfurt am Main, Germany

author Vehreschild Vehreschild Jörg Janne JJ

Goethe-Universität Frankfurt, Institut für Digitale Medizin und Klinische Datenwissenschaften, Frankfurt am Main, Germany

author Maier Maier Daniel D

Goethe-Universität Frankfurt, Institut für Digitale Medizin und Klinische Datenwissenschaften, Frankfurt am Main, Germany

author German Medical Science GMS Publishing House

Düsseldorf

610 systematic review large language model (LLM) artificial intelligence (AI) 20251103 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0631 156 Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) PS 8: Medizinische Biometrie Jena 20250907 20250911 Abstr. 336 TextIntroduction: Systematic reviews are an important instrument to synthesize existing knowledge for evidence-based medicine . While a manually conducted systematic review is a highly resource-intensive and error-prone project, the implementation of Large Language Models (LLMs) could support critical steps such as the screening of titles and abstracts and full-text data extraction. While the performance of LLMs for systematic reviews has been demonstrated in recent publications , , , a user-friendly, accessible framework is still missing. This study proposes and evaluates such an LLM-supported systematic review-framework.Methods: Based on a published systematic review about clinical prognostic scores for COVID-19 , we developed and validated an LLM-supported framework aligned with PRISMA guidelines. Based on an evaluation of performance and cost-effectiveness, we opted for implementing OpenAI’s Application Programming Interface (API) and the o3 model . In two subsequent steps, we called the API to (i) screen each article’s title and abstract as well as (ii) it’s full text (provided as an Extensible Markup Language [XML] file). Therefore, we engineered a dynamic prompt with detailed context information regarding the LLM’s persona, the instruction’s target and its topic, the specific task, the inclusion and exclusion criteria, and the output format we wished to receive. More specifically, the LLM was requested to check each inclusion and exclusion criterion and to create a brief explanation. Articles meeting the criteria were included and excluded otherwise. Performance metrics (accuracy, Cohen’s κ, sensitivity, specificity, false-positive rate [FPR], false-negative rate (FNR)] were compared to the results of the original human review that served as ground truth.Results: We included 1383 scientific articles. Preliminary results for a subset of these articles demonstrate the LLM's performance in title and abstract screening with an accuracy of 91.6%, a low FNR (6.8%) and a Cohen’s κ of 0.75 (95% CI 0.69-0.80). Moderate overall accuracy was observed during the subsequent full text screening phase (accuracy 77.9%, FNR 10.9%, Cohen’s κ of 0.49 [95% CI 0.37-0.62]).Conclusion: The findings suggest the LLM-based approach could substantially accelerate screening efforts and reduce manual workload. However, the performance decline revealed for the fulltext screening requires further investigation and demonstrates the contextual complexity of the original human review’s study aim. We propose that LLMs such as o3 are best utilized to assist researchers in the systematic review process, but not to perform a fully automated systematic review.Daniel Maier received speaker honoraria from Free University Berlin and travel compensation from IQVIA. Jörg Janne Vehreschild received payments or honoraria from Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), University Hospital Freiburg/ Congress and Communication, Academy for Infectious Medicine, University Manchester, German Society for Infectious Diseases (DGI), Ärztekammer Nordrhein, Ärztekammer Hessen, University Hospital Aachen, Back Bay Strategies, German Society for Internal Medicine (DGIM), Shionogi, Molecular Health, Netzwerk Universitätsmedizin, Janssen, NordForsk, Biontech, APOGEPHA, German Cancer Consortium (DKTK), University Hospital Oldenburg. Jörg Janne Vehreschild has grants from Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), German Federal Ministry of Education and Research (BMBF), Deutsches Zetrum für Luft- und Raumfahrt (DLR), University of Bristol, Rigshospitalet Copenhagen, German Network University Medicine, German Cancer, Consortium (DKTK), German Federal Ministry of Health (BMG), European Union. Jörg Janne Vehreschild received support for attending meetings and/or travel from German Centre for Infection Research (DZIF), University Manchester, German Society for Infectious Diseases (DGI), University Hospital Aachen, German Society for Internal Medicine (DGIM), Netzwerk Universitätsmedizin, German Cancer Consortium (DKTK). Jörg Janne Vehreschild participated on Data Safety Monitoring Boards or Advisory Boards of Merck / MSD, Gilead, Pfizer, Astellas Pharma, Basilea, German Centre for Infection Research (DZIF), Academy for Infectious Medicine, University Manchester, German Society for Infectious Diseases (DGI), German Society for Internal Medicine (DGIM), Netzwerk Universitätsmedizin, Janssen, Biontech. Zaira R. All other authors report no conflicts of interests.The authors declare that an ethics committee vote is not required. Affengruber L van der Maten MM Spiero I Nussbaumer-Streit B Mahmic-Kaknjo M Ellen ME An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review 2024 BMC Med Res Methodol 210 Affengruber L, van der Maten MM, Spiero I, Nussbaumer-Streit B, Mahmic-Kaknjo M, Ellen ME, et al. An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review. BMC Med Res Methodol. 2024;24:210. DOI: 10.1186/s12874-024-02320-4 http://dx.doi.org/10.1186/s12874-024-02320-4 Hanegraaf P Wondimu A Mosselman JJ de Jong R Abogunrin S Querios L Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review 2024 BMJ Open e076912 Hanegraaf P, Wondimu A, Mosselman JJ, de Jong R, Abogunrin S, Querios L, et al. Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review. BMJ Open.2024; 14(3):e076912. DOI: 10.1136/bmjopen-2023-076912 http://dx.doi.org/10.1136/bmjopen-2023-076912 Delgado-Chaves FM Jennings MJ Atalaia A Wolff J Horvath R Mamdouh Z M Baumbach J Baumbach L Transforming literature screening: The emerging role of large language models in systematic reviews 2025 PNAS e2411962122 Delgado-Chaves FM, Jennings MJ, Atalaia A, Wolff J, Horvath R, Mamdouh Z M, Baumbach J, Baumbach L. Transforming literature screening: The emerging role of large language models in systematic reviews. PNAS. 2025;122(2):e2411962122. DOI: 10.1073/pnas.2411962122 http://dx.doi.org/10.1073/pnas.2411962122 Appel KS Geisler R Maier D Miljukov O Hopff SM Vehreschild JJ A Systematic review of Predictor Composition, Outcomes, Risk of Bias, and Validation of COVID-19 Prognostic Scores 2024 Clin Infect Dis 889-899 Appel KS, Geisler R, Maier D, Miljukov O, Hopff SM, Vehreschild JJ. A Systematic review of Predictor Composition, Outcomes, Risk of Bias, and Validation of COVID-19 Prognostic Scores. Clin Infect Dis. 2024 Apr 15;78(4):889-899. DOI: 10.1093/cid/ciad618 http://dx.doi.org/10.1093/cid/ciad618 Open AI 2025 o3 Our most powerful reasoning model Open AI. o3 Our most powerful reasoning model. San Francisco, California, U.S.: Open AI; 2025 [cited 27 June 2025]. Available from: https://platform.openai.com/docs/models/o3 https://platform.openai.com/docs/models/o3 0 0 0 0