70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
A Benchmark Dataset for Evaluating Large Language Models on German Clinical Guidelines
2Universitätsmedizin Greifswald, Klinik für Anästhesie, Intensiv-, Notfall- und Schmerzmedizin, Greifswald, Germany
Text
Introduction: Clinical practice guidelines are essential tools for evidence-based healthcare, but clinicians often face practical issues when trying to quickly access specific guideline information during clinical decision-making. Guidelines are typically presented in textual formats that require time-consuming manual navigation to answer care-related questions. This becomes especially challenging for living guidelines, which are frequently updated. Large language models (LLMs) have shown strong performance in medical question answering, suggesting their potential to support more efficient access to medical knowledge [1], [2]. Although multiple benchmarking methods for evaluating different capabilities of LLMs exist [3], [4], [5], their ability to retrieve and apply information from German-language guidelines has not yet been systematically evaluated. This critical capability for guideline-based decision support systems in German-speaking areas has so far been limited by the absence of a dedicated benchmarking dataset.
Methods: We designed a benchmarking dataset to evaluate the effectiveness of large language models to retrieve and apply information from German clinical practice guidelines. To this end, we selected ten current guidelines with the highest quality level (S3) from the German guideline registry maintained by the Association of the Scientific Medical Societies (AWMF), covering five medical specialties: intensive care, anesthesiology, pediatrics, gastroenterology, and general medicine. Based on these guidelines, we generated multiple-choice questions following established principles for high-quality question creation. To ensure quality and clinical relevance, we conducted a structured expert review using a validation questionnaire assessing each question's correctness, relevance, complexity, and frequency in clinical practice.
Results: The primary result of this work is a benchmark dataset of 200 multiple-choice questions for evaluating the ability of LLMs on German guideline-based question answering. The questions include both case-based and knowledge-based formats. Each question includes a clinical scenario or direct query, one correct answer, and three distractors. Furthermore, each question is annotated with metadata including guideline source, chapter reference, specialty, difficulty rating, and question format. Expert validation was conducted by clinicians with at least three years of specialty-specific experience. 72% of questions were rated as relevant or very relevant to clinical practice. While 52% were considered difficult or very difficult for assistant physicians, the same questions were largely rated as easy by specialists, highlighting variation in perceived complexity by experience level.
Discussion: This work addresses a critical gap in the evaluation of LLMs for clinical decision support in German-speaking settings. By including multiple specialties, diverse question formats, and expert-rated metadata, the dataset enables systematic and clinically meaningful LLM performance analysis across varied contexts, providing a foundation for reproducible testing of LLM capabilities on German-language guideline content. It enables targeted exploration of model capabilities in specialty-specific contexts, supporting the development of trustworthy decision support systems for German-speaking healthcare environments.
Conclusion: We present a new benchmark dataset specifically designed to assess the effectiveness of LLMs in retrieving and applying medical knowledge from German clinical practice guidelines. The dataset will be made publicly available at https://github.com/umg-minai/guideline-rag-llm.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025 Jan 8. DOI: 10.1038/s41591-024-03423-7[2] Ke YH, Jin L, Elangovan K, Abdullah HR, Liu N, Sia ATH, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. npj Digit Med. 2025;8:187. DOI: 10.1038/s41746-025-01519-z
[3] Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025 Jan 28;333(4):319. DOI: 10.1001/jama.2024.21700
[4] Budler LC, Chen H, Chen A, Topaz M, Tam W, Bian J, et al. A brief review on benchmarking for large language models evaluation in healthcare. Wiley Interdiscip Rev Data Min Knowl Discov. 2025;15(2):e70010. DOI: 10.1002/widm.70010
[5] Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking retrieval-augmented generation for medicine. In: Findings of the Association for Computational Linguistics: ACL 2024. 2024. p. 6233–51. DOI: 10.18653/v1/2024.findings-acl.372



