70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Retrieval-Augmented Generation for Semantic Annotation of Clinical Trials on the Example of ICD-10 Codes
2SLK Kliniken Heilbronn GmbH, Heilbronn, Germany
Text
Introduction: Accurate, structured annotation of clinical research data is essential to support personalized oncology workflows, such as matching patients to relevant clinical trials [1]. However, manual coding of clinical diagnoses using standardized terminologies remains time-consuming and prone to inconsistencies [2].
This study explores the application of Retrieval-Augmented Generation (RAG) [3] for accurate ICD-10-CM classification of cancer-related diagnoses.
The goal is enabling automated semantic annotation of clinical trials in precision oncology. By mapping unstructured diagnosis descriptions to standardized codes, this improves interoperability and enhances analysis of clinical trial documentation.
State of the art: Primary clinical trial registries such as clinicaltrials.gov and DRKS often do not allow for semantic searching on investigated conditions, leaving the risk of studies being omitted and disregarded in a search.
Current RAG architectures leverage vector databases and language models for information retrieval and response generation [3]. Other groups used RAG in clinical trial screening, using cloud models [4].
Concept: Our prototype is intended to generate precise, contextually relevant ICD-10 codes for free-text condition names. Relevant terminology information is provided to the RAG pipeline, chunked (split into smaller units) to facilitate logically coherent retrieval and stored in a vector store via local embedding.
When translating, relevant mappings are retrieved from the store, provided to the LLM, along with a system prompt containing examples, to extract matching codes.
We continually evaluate the mapped codes, as well as the pure retrieved context without LLM involvement to enable iterative optimization of parameters.
Implementation: The RAG system is implemented in Python as OpenWebui pipeline, llama-index, langchain, ChromaDB/FAISS [5] as vector storage. The ICD-10 classification is provided in JSON. Knowledge is chunked and embedded into the vector store via nomic-embed-text as embedding model.
Context retrieval uses strategies like similarity search. LLMs are hosted locally via Ollama. Regex-based post-processing extracts ICD codes from the LLM output.
For evaluation, retrieved context and the whole pipelines’ output are checked against the expected codes via JavaScript. Two datasets were used: One with ICD code-display pairs randomly selected from the ICD-10 classification chapter 2 (n=30), the other with free-text diagnoses from public trial records, translated to ICD-10 codes by a physician (n=52).
Lessons learned: Lessons drawn from this work:
- Logical chunking seems to work well for information retrieval in our use case
- Identifying ICD main groups is much easier than finding more specific codes
- Some descriptions lack specificity, preventing mapping (e.g. ‘Refractory Cancer’)
- Detailed prompts help LLMs to produce precise outputs
These led to promising, intermediary results: Using Chroma vector storage, similarity scoring retrieval strategy, a k value of 5, Phi4 as LLM, the ICD-10 main group (e.g. C00 for C00.1) was correct for all generated codes in 37 of 52 (71.2%) test conditions. Full match between generated and expected codes was achieved in 17 (32.7%) conditions. For test cases from the ICD-10 classification, the setup achieved correct main groups in 25 of 30 (83.3%) cases, complete matches in 22 (73.3%) cases. This highlights difficulties caused by non-standardized condition descriptions (e.g. ‘Cml’) and indicates room for further improvement.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Zeng J, Shufean MA, Khotskaya Y, Yang D, Kahle M, Johnson A, et al. OCTANE: Oncology Clinical Trial Annotation Engine. JCO Clin Cancer Inform. 2019;3:1–11. DOI: 10.1200/CCI.18.00145[2] Miñarro-Giménez JA, Martínez-Costa C, Karlsson D, Schulz S, Gøeg KR. Qualitative analysis of manual annotations of clinical text with SNOMED CT. PloS One. 2018;13:e0209547. DOI: 10.1371/journal.pone.0209547
[3] Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, et al. Retrieval-Augmented Generation for Large Language Models: A Survey [Preprint]. arXiv. 2023. DOI: 10.48550/ARXIV.2312.10997
[4] Tan R, Ho SX, Oo SVF, Chua SL, Zaw MWW, and Tan DS-W. Retrieval-augmented large language models for clinical trial screening. J Clin Oncol. 2024;42:e13611–e13611. DOI: 10.1200/JCO.2024.42.16_suppl.e13611
[5] Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré P-E, et al. The Faiss library [Preprint]. arXiv. 2025. DOI: 10.48550/arXiv.2401.08281



