70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Accessible pipeline for intelligent search on in-house clinical guidelines
2UKE Hamburg-Eppendorf, Hamburg, Germany
Text
Introduction: Premedication in anesthesia requires clinicians to follow an evolving set of clinical guidelines. These documents are often lengthy and stored across multiple documents, making quick access to relevant information during clinical workflows challenging. Existing retrieval-augmented generation (RAG) systems using large language models provide a promising foundation for intelligent document querying [1]. However, standard methods for splitting documents into search units may not be well-suited for the structure and complexity of clinical guidelines. While small chunks can improve retrieval precision, they often lack the broader context required for coherent and reliable answers [2]. To address these limitations, we introduce NicerSlicer [3] to slice large documents into semantically coherent sections. We also present MedQueryGuide [4], an interactive frontend for intelligent clinical guideline search.
Methods: NicerSlicer generates section splits, which users can refine by discarding, joining, splitting or adjusting sections. The sections can be downloaded and integrated into RAG pipelines. MedQueryGuide is an interactive frontend for RAG-based question answering, featuring:
- a vector store with metadata filtering for targeted search across multiple documents,
- a recursive retriever returning small content units along with their broader context sections to support more informative answers,
- a user feedback function (“helpful” vs. “not helpful”) to iteratively improve search quality and content relevance.
We quantitatively evaluated the RAG pipeline using 100 automatically generated question answer pairs derived from four clinical guidelines (127 pages). Of these, 70 questions required information from a single section (single-hop), while 30 involved reasoning across two distinct sections (multi-hop). Retrieval is evaluated by using the hit rate and mean reciprocal rank (MRR), while the answer quality was evaluated against reference answers using BERTScore [5]. Additionally, one anesthetist assessed 10 real-world questions in MedQueryGuide using its feedback system to explore whether autogenerated evaluation aligns with real-world scenarios.
Results: For single-hop questions we achieved a hit rate of 0.87 and a MRR of 0.82. BERTScore of 0.79 indicates high degree of semantic similarity with reference answers. Our approach achieved a hit rate of 0.92 for at least one of the documents for multi-hop questions, however, when considering both reference documents the hit rate dropped to 0.54. Moreover, the BERTScore decreased to 0.75. This decline is reflected in the real-world evaluation: all five simple queries led to helpful documents and answers, while the five more complex or vague queries produced less relevant retrievals and unhelpful answers. Of 20 documents retrieved for five complex queries, only 10 were rated as helpful by the physician.
Discussion: A key challenge lies in working with PDFs, common extraction methods often fall short and even state of the art visual language models can introduce subtle errors. The real-world evaluation, though limited in size, reflected the performance gap between simple and complex queries. As a next step, we plan to fine-tune retrieval using synthetic data to improve performance on complex queries.
Conclusion: Our results demonstrate the importance of context-aware retrieval and flexible document segmentation when building intelligent search systems for clinical guidelines. Tools like NicerSlicer and MedQueryGuide show that tailored solutions can meaningfully support clinical workflows.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Literatur
[1] Ng KKY, Matsuba I, Zhang PC. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI. 2025 Jan;2(1):AIra2400380.[2] Bhat SR, Rudat M, Spiekermann J, Flores-Herr N. Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis [Preprint]. arXiv. 2025. DOI: 10.48550/arXiv.2505.21700
[3] UKEIAM/NicerSlicer: Repository to create nicely sliced PDFs for your RAG. GitHub; [cited 2025 Jun 20]. Available from: https://github.com/UKEIAM/NicerSlicer/tree/main
[4] IAMspiegel/MedQueryGuide: RAG application for medical guidelines. GitHub; [cited 2025 Jun 20]. Available from: https://github.com/IAMspiegel/MedQueryGuide
[5] Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: Evaluating Text Generation with BERT. In: International Conference on Learning Representations (ICLR) 2020. Available from: https://openreview.net/forum?id=SkeHuCVFDr



