Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena

Meeting Abstract

Implementing Local Open-Source LLMs for Chatbot-Based Document Retrieval in a University Hospital

Christoph Demus - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Jan Bossenz - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Carlo Günzl - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Annemarie Weise - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Fabian Berns - medicalvalues GmbH, Karlsruhe, Germany
Christian Jäger - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Thomas Weber - University Hospital Halle (Saale), Department of Internal Medicine IV, Hematology and Oncology, Halle (Saale), Germany
Jan Kirchhoff - medicalvalues GmbH, Karlsruhe, Germany
Jan Christoph - Junior Research Group (Bio-)Medical Data Science, Faculty of Medicine, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany

Text

Introduction: The integration of large language models (LLMs) into healthcare, with its strict data protection and security requirements, poses particular challenges, which is why powerful proprietary models such as ChatGPT are not suitable, even if they are increasingly being used informally. To close this gap, this project presents a locally hosted open-source LLM and retrieval-augmented generation (RAG) system that enables interaction with a locally stored, extensive internal document library at the University Hospital Halle via a chatbot.

Methods: The system is hosted on a local Linux server equipped with an Nvidia H100 GPU and utilizes the Falcon-7B-Instruct model. Internal documents - primarily SOPs, procedural guidelines and forms - were cleaned, contextually chunked, and embedded using the jina-embeddings-v3 model. The system retrieves documents based on combined cosine similarity and BM25 scoring. Evaluation involved (1) subjective user feedback collected during a one-month test phase (2) expert annotation of chatbot responses to 38 manually crafted evaluation questions and (3) a technically analysis of the response time.

Results: During the test phase the system handled 287 user queries, receiving 72 feedback entries (25.01% response rate). Feedback on document relevance was generally more positive (59.26%) than the generated textual answers (43.50%). Expert evaluation using the test dataset yielded an average helpfulness rating of 72%. On a scale from one (worse) to five (best) expert evaluation reached a score of 3.72 for textual answers and 3.89 for retrieved documents. Technical performance showed acceptable latency, with 75% of responses returned in under five seconds. First queries in a session averaged 3.71s, while follow-ups averaged with 4.28s slightly longer response times due to an additional LLM-request in the retrieval phase.

Discussion: Initial results show that even small LLMs can yield useful results when coupled with well-designed retrieval systems. Although user feedback was mixed, it reflected the early prototype status and the subjective nature of expectations. Document retrieval quality was consistently rated higher than the textual synthesis, highlighting limitations in inter-document reasoning by smaller LLMs. Evaluation discrepancies between users and expert annotators likely stem from varied prompt wording, query complexity and expectations. Response time analysis confirmed that the system operates within acceptable bounds for real-world usage.

Conclusion: This proof-of-concept demonstrates the viability of a locally hosted LLM-based chatbot for internal document retrieval in healthcare settings. While textual response quality can be further improved, this proof of concept system already offers a useful alternative to basic keyword search. Future improvements should focus on the improvement of the retrieval process, metadata enrichment, and advanced reasoning capabilities. Moreover, in order to facilitate a comprehensive introduction to the system, it would be essential to establish appropriate management protocols to regulate document access by various individuals. Overall, the project underscores the feasibility and potential of secure, on-premise AI tools in clinical environments.

Jan Kirchhoff and Fabian Berns are employed by medicalvalues GmbH, which provided consulting support during the design and development of the system. Jan Kirchhoff is the managing director of medicalvalues GmbH.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Lewis P, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems NIPS 20. Vancouver, BC, Canada: Curran Associates Inc., 2020. DOI: 10.5555/3495724.3496517
[2] Muti HS, et al. Customizing GPT-4 for clinical information retrieval from standard operating procedures. medRxiv. 2024. DOI: 10.1101/2024.06.24.24309221
[3] Amugongo LM, Mascheroni P, Brooks SG, Doering S, Seidel J. Retrieval Augmented Generation for Large Language Models in Healthcare: A Systematic Review. Preprints. Jul 2024. DOI: 10.20944/preprints202407.0876.v1
[4] Huang L, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems. 2025 Jan;43(2):1–55. DOI: 10.1145/3703155