70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Exploiting the ICD Catalogue for Teaching German Oncological Terminology to Large Language Models
2University Cancer Center of the Johannes Gutenberg University Mainz, Mainz, Germany
3Third Department of Medicine, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
Text
Introduction: Medical coding of free-text diagnoses is essential for structured documentation and cancer registry reporting. In Germany, both ICD-10-GM and ICD-O-3 codes are required for documenting tumor diagnoses. Despite advances in natural language processing, large language models (LLMs) still struggle with accurate coding, especially in non-English contexts, where publicly available training data is scarce [1]. This study investigates whether instruction fine-tuning using only data from public German catalogues can improve the ICD coding capabilities of LLMs for tumor diagnoses.
Methods: Using the instruction-tuned variants of the LLMs Llama 3.1 8B [2] and Qwen 2.5 7B [3], we perform instruction-based fine-tuning with over 500,000 question-answer pairs generated from the alphabetical index (Alpha-ID) of the ICD-10-GM, the ICD-O-3, and OPS catalogues [4]. The questions cover coding tasks and the recognition of tumor diagnoses, for which the OPS therapy descriptions were used as closely related negative examples. The models were evaluated before and after each of five fine-tuning epochs using 2,024 unique, real-world tumor diagnosis texts documented in 2023 and 2024 in the tumor documentation system of the University Cancer Center Mainz.
Results: Both models demonstrated substantial performance improvements on the real-world test data after fine-tuning. For ICD-10-GM coding, the exact accuracy (i.e., the full code being correct) increased from 5.6% to 45% for Llama 3.1 8B and from 0.9% to 41% for Qwen 2.5 7B. Partial accuracy (correct first three characters) rose from 46% to 77% for Llama and from 19% to 73% for Qwen. ICD-O-3 topography coding also improved, though performance remained considerably lower than for ICD-10-GM coding, with a maximum accuracy of 27% and 59% reached by Llama 3.1 8B for the full and partial match, respectively. The rate of malformed codes as outputs was reduced to near zero. Recognition accuracy for tumor diagnoses improved from 94% (Llama) and 90% (Qwen) to 99% for both models.
Discussion: Our findings demonstrate that fine-tuning based on question-answer pairs constructed from public catalogues can substantially improve the coding performance of open-source LLMs. While approaches that train directly on the clinical texts can reach higher accuracies [5], our study shows that the full potential of training LLMs on public data is not fully exhausted and that transforming structured public information into targeted instructions offers a promising and underutilized approach to model training. The importance of creating a diverse dataset is reflected in the relatively lower performance on ICD-O-3 topography compared to ICD-10-GM, which correlates with the lower linguistic diversity in the ICD-O-3 catalogue. In the future, adding more diverse or reasoning-based examples such as chain-of-thought prompts that leverage the classification structure more effectively could further improve model performance.
Conclusion: We present an instruction-tuning approach based entirely on public catalogues to improve the ICD-10-GM and ICD-O-3 coding performance of LLMs on German tumor diagnoses. While the fine-tuned models showed substantial performance gains, the accuracy levels remain below the threshold for reliable clinical use, particularly with regard to full code prediction. Both the dataset and the best trained model snapshots are available from https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Puts S, Zegers CML, Dekker A, Bermejo I. Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based CodeSelection. JMIR Formative Research. 2025 Feb 11;9(1):e60095. DOI: 10.2196/60095[2] Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, et al. The Llama 3 Herd of Models [Preprint]. arXiv. 2024. DOI: 10.48550/arXiv.2407.21783
[3] Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, et al. Qwen2.5 Technical Report [Preprint]. arXiv. 2025. DOI: 10.48550/arXiv.2412.15115
[4] Federal Institute for Drugs and Medical Devices (BfArM). Code Systems - Classifications. [cited 2025 Apr 9]. Available from: https://www.bfarm.de/EN/Code-systems/Classifications/_node.html
[5] Böhringer D, Angelova P, Fuhrmann L, Zimmermann J, Schargus M, Eter N, et al. Automatic inference of ICD-10 codes from German ophthalmologic physicians’ letters using natural language processing. Sci Rep. 2024 Apr 19;14(1):9035. DOI: 10.1038/s41598-024-59926-3



