70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
LLMs for data extraction in a systematic review: Is AI yet ready for the task?
2Complex Medical Informatics, Medizinische Fakultät Mannheim, Universität Heidelberg, Mannheim, Germany
3Mannheimer Institut für intelligente Systeme in der Medizin, Medizinische Fakultät Mannheim, Universität Heidelberg, Mannheim, Germany
4Deutsches Konsortium für Translationale Krebsforschung (DKTK), DKFZ, Kernzentrum Heidelberg, Heidelberg, Germany
5Klinik für Anästhesiologie, Operative Intensivmedizin und Schmerzmedizin, Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Germany
6Mannheim Institute for Innate Immunoscience (MI3), Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Germany
Text
Introduction: Systematic reviews are typically conducted following established standards [1]. With their increasing capabilities, Artificial intelligence (AI) large language models (LLMs) such as ChatGPT may assist with systematic reviews, but their performance in data extraction from published studies is unclear. We therefore evaluated the consistency of data extracted by humans and AI, offering insights into AI's potential in systematic reviews.
Methods: We used extracted data available from a time-stratified subset (N=45) of our systematic review of studies in ICU patients with sepsis risk or sepsis treatment that utilized propensity score methods [2]. From each time stratum we selected one study (N=5) for data re-extraction with freely accessible AI tools (ChatGPT/ChatGPT-4-turbo, Deep Seek/DeepSeekR1 [3], Qwen/Qwen2.5-Max [4], Mistral/Mistral 1.0 (LeChat) [5], Grok/Grok-3). The AI received a prompt containing instructions on requirements, expected result format and the previously employed data extraction form comprising 24 questions. Two independent evaluators compared pre-existing and AI-generated free-text answers, and single/multiple choice answers were compared with python. We assigned a value of 2 for full agreement, 1 for partial agreement and 0 for disagreement. Correct answers had a value of ≥1. We quantified agreement with weighted Cohen’s kappa(κ). For this, we first compared each of the human reviews to the human consensus, followed by each AI to each human reviewer’s consensus-aligned assessment.
Results: Overall, all AI yielded a similar correct proportion of the 120 answers: ChatGPT 70.8%, Mistral 73.3%, Grok 74.2%, Qwen 75% and DeepSeek 78.3%. Averages for correct AI answers [Min-Max] varied by study, ranging from 60.0% [54.2%-66.7%] to 83.3% [66.7%-95.8%]. The weighted inter-rater agreement between the two human reviewers and their consensus was substantial (Cohen’s κ=0.65). The AI agreement with the consensus-aligned ratings of the first human reviewer were: ChatGPT (κ=0.36), DeepSeek (κ=0.48), Qwen (κ=0.48), Mistral (κ=0.44), and Grok (κ=0.41), indicating fair to moderate agreement. Compared to the second human reviewer, the weighted kappa values were: ChatGPT (κ=0.28), DeepSeek (κ=0.46), Qwen (κ=0.41), Mistral (κ=0.41), and Grok (κ=0.42), also reflecting fair to moderate agreement. Across all studies, for some questions, both individual humans and AI noticeably deviated from human consensus. For additional questions the AI strongly deviated from human answers. For the remaining questions some or all AIs’ answers matched humans’.
Discussion: The substantial agreement indicates a generally reliable human baseline. Questions being challenging for both AI and human reviewers reflect difficulties in extracting complex or ambiguously reported, often methodological information. Agreement variability for such questions across studies indicates that some publications facilitated data extraction, but characteristics for reporting recommendations are yet to be identified. Prompt engineering may enhance AI-performance for some questions. AI-assistance for other systematic review steps remains untested. A larger study sample would allow a more precise determination of agreement and particular strengths of specific AIs.
Conclusion: While current LLMs show potential for supporting data extraction in systematic reviews, particularly when guided by well-structured forms, their agreement with human reviewers remains limited and their performance is insufficient to replace human judgment. Future strategies may include flagging low AI confidence, enabling human reviewers to prioritize, thereby enhancing review efficiency.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Literatur
[1] Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. J Clin Epidemiol. 2021;134:178-89. DOI: 10.1016/j.jclinepi.2021.03.001[2] Stenzel M, Holke F, Schneider-Lindner V. Propensity score methods in sepsis research in critical care: a systematic review. PROSPERO; 2023. Available from: https://www.crd.york.ac.uk/PROSPERO/view/CRD42023458707
[3] DeepSeek-Ai, Liu A, Feng B, Xue B, Wang B, Wu B, et al. DeepSeek-V3 Technical Report [Preprint]. arXiv. 2024 Dec 27. DOI: 10.48550/arXiv.2412.19437
[4] Qwen, Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, et al. Qwen2.5 Technical Report [Preprint]. arXiv. 2024 Dec 19. DOI: 10.48550/arXiv.2412.15115
[5] Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Singh Chaplot D, de las Casa D, et al. Mistral 7B [Preprint]. arXiv. 2023 Oct 10. DOI: 10.48550/arXiv.2310.06825



