Retrieval-Augmented Generation for Semantic Annotation of Clinical Trials on the Example of ICD-10 Codes

25gmds061 10.3205/25gmds061 urn:nbn:de:0183-25gmds0617 Meeting Abstract Retrieval-Augmented Generation for Semantic Annotation of Clinical Trials on the Example of ICD-10 Codes Lehmann Lehmann Paula P

MOLIT Institut gGmbH, Heilbronn, Germany

author Mathes Mathes Georg G

MOLIT Institut gGmbH, Heilbronn, Germany

author Vishnevskaya Vishnevskaya Valeriya V

MOLIT Institut gGmbH, Heilbronn, Germany

author Bochum Bochum Sylvia S

SLK Kliniken Heilbronn GmbH, Heilbronn, Germany

author Sigle Sigle Stefan S

MOLIT Institut gGmbH, Heilbronn, Germany

author German Medical Science GMS Publishing House

Düsseldorf

610 retrieval-augmented generation terminologies ICD-10 clinical trial annotation semantic annotation 20251103 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0631 061 Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) V: Large language models & medical texts 2 Jena 20250907 20250911 Abstr. 296 TextIntroduction: Accurate, structured annotation of clinical research data is essential to support personalized oncology workflows, such as matching patients to relevant clinical trials . However, manual coding of clinical diagnoses using standardized terminologies remains time-consuming and prone to inconsistencies . This study explores the application of Retrieval-Augmented Generation (RAG) for accurate ICD-10-CM classification of cancer-related diagnoses. The goal is enabling automated semantic annotation of clinical trials in precision oncology. By mapping unstructured diagnosis descriptions to standardized codes, this improves interoperability and enhances analysis of clinical trial documentation.State of the art: Primary clinical trial registries such as clinicaltrials.gov and DRKS often do not allow for semantic searching on investigated conditions, leaving the risk of studies being omitted and disregarded in a search. Current RAG architectures leverage vector databases and language models for information retrieval and response generation . Other groups used RAG in clinical trial screening, using cloud models .Concept: Our prototype is intended to generate precise, contextually relevant ICD-10 codes for free-text condition names. Relevant terminology information is provided to the RAG pipeline, chunked (split into smaller units) to facilitate logically coherent retrieval and stored in a vector store via local embedding. When translating, relevant mappings are retrieved from the store, provided to the LLM, along with a system prompt containing examples, to extract matching codes.We continually evaluate the mapped codes, as well as the pure retrieved context without LLM involvement to enable iterative optimization of parameters.Implementation: The RAG system is implemented in Python as OpenWebui pipeline, llama-index, langchain, ChromaDB/FAISS as vector storage. The ICD-10 classification is provided in JSON. Knowledge is chunked and embedded into the vector store via nomic-embed-text as embedding model. Context retrieval uses strategies like similarity search. LLMs are hosted locally via Ollama. Regex-based post-processing extracts ICD codes from the LLM output.For evaluation, retrieved context and the whole pipelines’ output are checked against the expected codes via JavaScript. Two datasets were used: One with ICD code-display pairs randomly selected from the ICD-10 classification chapter 2 (n=30), the other with free-text diagnoses from public trial records, translated to ICD-10 codes by a physician (n=52).Lessons learned: Lessons drawn from this work: Logical chunking seems to work well for information retrieval in our use case Identifying ICD main groups is much easier than finding more specific codes Some descriptions lack specificity, preventing mapping (e.g. ‘Refractory Cancer’) Detailed prompts help LLMs to produce precise outputs These led to promising, intermediary results: Using Chroma vector storage, similarity scoring retrieval strategy, a k value of 5, Phi4 as LLM, the ICD-10 main group (e.g. C00 for C00.1) was correct for all generated codes in 37 of 52 (71.2%) test conditions. Full match between generated and expected codes was achieved in 17 (32.7%) conditions. For test cases from the ICD-10 classification, the setup achieved correct main groups in 25 of 30 (83.3%) cases, complete matches in 22 (73.3%) cases. This highlights difficulties caused by non-standardized condition descriptions (e.g. ‘Cml’) and indicates room for further improvement.The authors declare that they have no competing interests.The authors declare that an ethics committee vote is not required. Zeng J Shufean MA Khotskaya Y Yang D Kahle M Johnson A OCTANE: Oncology Clinical Trial Annotation Engine 2019 JCO Clin Cancer Inform 1–11 Zeng J, Shufean MA, Khotskaya Y, Yang D, Kahle M, Johnson A, et al. OCTANE: Oncology Clinical Trial Annotation Engine. JCO Clin Cancer Inform. 2019;3:1–11. DOI: 10.1200/CCI.18.00145 http://dx.doi.org/10.1200/CCI.18.00145 Miñarro-Giménez JA Martínez-Costa C Karlsson D Schulz S Gøeg KR Qualitative analysis of manual annotations of clinical text with SNOMED CT 2018 PloS One e0209547 Miñarro-Giménez JA, Martínez-Costa C, Karlsson D, Schulz S, Gøeg KR. Qualitative analysis of manual annotations of clinical text with SNOMED CT. PloS One. 2018;13:e0209547. DOI: 10.1371/journal.pone.0209547 https://doi.org/10.1371/journal.pone.0209547 Gao Y Xiong Y Gao X Jia K Pan J Bi Y Retrieval-Augmented Generation for Large Language Models Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, et al. Retrieval-Augmented Generation for Large Language Models: A Survey [Preprint]. arXiv. 2023. DOI: 10.48550/ARXIV.2312.10997 http://dx.doi.org/10.48550/ARXIV.2312.10997 Tan R Ho SX Oo SVF Chua SL Zaw MWW and Tan DS-W Retrieval-augmented large language models for clinical trial screening 2024 J Clin Oncol e13611–e13611 Tan R, Ho SX, Oo SVF, Chua SL, Zaw MWW, and Tan DS-W. Retrieval-augmented large language models for clinical trial screening. J Clin Oncol. 2024;42:e13611–e13611. DOI: 10.1200/JCO.2024.42.16_suppl.e13611 http://dx.doi.org/10.1200/JCO.2024.42.16_suppl.e13611 Douze M Guzhva A Deng C Johnson J Szilvasy G Mazaré P-E The Faiss library 2025 arXiv Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré P-E, et al. The Faiss library [Preprint]. arXiv. 2025. DOI: 10.48550/arXiv.2401.08281 http://dx.doi.org/10.48550/arXiv.2401.08281 0 0 0 0