LLMs for data extraction in a systematic review: Is AI yet ready for the task?

25gmds060 10.3205/25gmds060 urn:nbn:de:0183-25gmds0603 Meeting Abstract LLMs for data extraction in a systematic review: Is AI yet ready for the task? Stenzel Stenzel Monique M

Federated Information Systems, Deutsches Krebsforschungszentrum (DKFZ), Heidelberg, Germany Complex Medical Informatics, Medizinische Fakultät Mannheim, Universität Heidelberg, Mannheim, Germany Mannheimer Institut für intelligente Systeme in der Medizin, Medizinische Fakultät Mannheim, Universität Heidelberg, Mannheim, Germany Deutsches Konsortium für Translationale Krebsforschung (DKTK), DKFZ, Kernzentrum Heidelberg, Heidelberg, Germany

author Albu Albu Alexandra A

Klinik für Anästhesiologie, Operative Intensivmedizin und Schmerzmedizin, Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Germany Mannheim Institute for Innate Immunoscience (MI3), Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Germany

author Holke Holke Franziska F

author Hahn Hahn Bianka B

author Tellbach Tellbach Joshua Georg JG

author Lablans Lablans Martin M

author Schneider-Lindner Schneider-Lindner Verena V

author German Medical Science GMS Publishing House

Düsseldorf

610 large language model (LLM) artificial intelligence in evidence synthesis systematic review - data extraction propensity score methods sepsis 20251103 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0631 060 Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) V: Large language models & medical texts 2 Jena 20250907 20250911 Abstr. 63 TextIntroduction: Systematic reviews are typically conducted following established standards . With their increasing capabilities, Artificial intelligence (AI) large language models (LLMs) such as ChatGPT may assist with systematic reviews, but their performance in data extraction from published studies is unclear. We therefore evaluated the consistency of data extracted by humans and AI, offering insights into AI's potential in systematic reviews. Methods: We used extracted data available from a time-stratified subset (N=45) of our systematic review of studies in ICU patients with sepsis risk or sepsis treatment that utilized propensity score methods . From each time stratum we selected one study (N=5) for data re-extraction with freely accessible AI tools (ChatGPT/ChatGPT-4-turbo, Deep Seek/DeepSeekR1 , Qwen/Qwen2.5-Max , Mistral/Mistral 1.0 (LeChat) , Grok/Grok-3). The AI received a prompt containing instructions on requirements, expected result format and the previously employed data extraction form comprising 24 questions. Two independent evaluators compared pre-existing and AI-generated free-text answers, and single/multiple choice answers were compared with python. We assigned a value of 2 for full agreement, 1 for partial agreement and 0 for disagreement. Correct answers had a value of ≥1. We quantified agreement with weighted Cohen’s kappa(κ). For this, we first compared each of the human reviews to the human consensus, followed by each AI to each human reviewer’s consensus-aligned assessment. Results: Overall, all AI yielded a similar correct proportion of the 120 answers: ChatGPT 70.8%, Mistral 73.3%, Grok 74.2%, Qwen 75% and DeepSeek 78.3%. Averages for correct AI answers [Min-Max] varied by study, ranging from 60.0% [54.2%-66.7%] to 83.3% [66.7%-95.8%]. The weighted inter-rater agreement between the two human reviewers and their consensus was substantial (Cohen’s κ=0.65). The AI agreement with the consensus-aligned ratings of the first human reviewer were: ChatGPT (κ=0.36), DeepSeek (κ=0.48), Qwen (κ=0.48), Mistral (κ=0.44), and Grok (κ=0.41), indicating fair to moderate agreement. Compared to the second human reviewer, the weighted kappa values were: ChatGPT (κ=0.28), DeepSeek (κ=0.46), Qwen (κ=0.41), Mistral (κ=0.41), and Grok (κ=0.42), also reflecting fair to moderate agreement. Across all studies, for some questions, both individual humans and AI noticeably deviated from human consensus. For additional questions the AI strongly deviated from human answers. For the remaining questions some or all AIs’ answers matched humans’.Discussion: The substantial agreement indicates a generally reliable human baseline. Questions being challenging for both AI and human reviewers reflect difficulties in extracting complex or ambiguously reported, often methodological information. Agreement variability for such questions across studies indicates that some publications facilitated data extraction, but characteristics for reporting recommendations are yet to be identified. Prompt engineering may enhance AI-performance for some questions. AI-assistance for other systematic review steps remains untested. A larger study sample would allow a more precise determination of agreement and particular strengths of specific AIs.Conclusion: While current LLMs show potential for supporting data extraction in systematic reviews, particularly when guided by well-structured forms, their agreement with human reviewers remains limited and their performance is insufficient to replace human judgment. Future strategies may include flagging low AI confidence, enabling human reviewers to prioritize, thereby enhancing review efficiency.The authors declare that they have no competing interests.The authors declare that an ethics committee vote is not required. Page MJ McKenzie JE Bossuyt PM Boutron I Hoffmann TC Mulrow CD The PRISMA 2020 statement: an updated guideline for reporting systematic reviews 2021 J Clin Epidemiol 178-89 Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. J Clin Epidemiol. 2021;134:178-89. DOI: 10.1016/j.jclinepi.2021.03.001 http://dx.doi.org/10.1016/j.jclinepi.2021.03.001 Stenzel M Holke F Schneider-Lindner V 2023 Propensity score methods in sepsis research in critical care: a systematic review Stenzel M, Holke F, Schneider-Lindner V. Propensity score methods in sepsis research in critical care: a systematic review. PROSPERO; 2023. Available from: https://www.crd.york.ac.uk/PROSPERO/view/CRD42023458707 https://www.crd.york.ac.uk/PROSPERO/view/CRD42023458707 DeepSeek-Ai Liu A Feng B Xue B Wang B Wu B DeepSeek-V3 Technical Report [Preprint] 2024 arXiv DeepSeek-Ai, Liu A, Feng B, Xue B, Wang B, Wu B, et al. DeepSeek-V3 Technical Report [Preprint]. arXiv. 2024 Dec 27. DOI: 10.48550/arXiv.2412.19437 http://dx.doi.org/10.48550/arXiv.2412.19437 Qwen Yang A Yang B Zhang B Hui B Zheng B Yu B Qwen2.5 Technical Report [Preprint] 2024 arXiv Qwen, Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, et al. Qwen2.5 Technical Report [Preprint]. arXiv. 2024 Dec 19. DOI: 10.48550/arXiv.2412.15115 http://dx.doi.org/10.48550/arXiv.2412.15115 Jiang AQ Sablayrolles A Mensch A Bamford C Singh Chaplot D de las Casa D Mistral 7B [Preprint] 2023 arXiv Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Singh Chaplot D, de las Casa D, et al. Mistral 7B [Preprint]. arXiv. 2023 Oct 10. DOI: 10.48550/arXiv.2310.06825 http://dx.doi.org/10.48550/arXiv.2310.06825 0 0 0 0