Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Introducing Medical Semantic Annotation Guidelines for German Clinical Documentation with SNOMED CT

Justin Hofenbitzer 1
Stefan Schulz 2,3
Martin Boeker 1
Peter Klügl 3
Sarah Riepenhausen 4
Christina Lohr 5
Jacqueline Lammert 1
Andrea Riedel 6,7
Luise Modersohn 1
1Technical University of Munich, TUM School of Medicine and Health, Institute for AI and Informatics in Medicine (AIIM), TUM University Hospital, Munich, Germany
2Medizinische Universität Graz, Graz, Austria
3Averbis GmbH, Freiburg, Germany
4University of Münster, Institute of Medical Informatics, Münster, Germany
5Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig University, Leipzig, Germany
6Erlangen University Hospital, Medical Center for Information and Communication Technology, Erlangen, Germany
7Friedrich-Alexander-Universität Erlangen-Nürnberg, Medical Informatics, Erlangen, Germany

Text

Introduction: The systematic annotation of clinical free-text documents with standardized terminologies, such as SNOMED CT (SCT), is essential for developing interoperable and semantically enriched healthcare resources. Domain-specific natural language corpora with meaningful and expressive annotations are fundamental for computational linguistics and natural language processing (NLP), enabling tasks like training, fine-tuning, and evaluating large language models. However, there is a shortage of clinical corpora [1], and only a few contain semantic annotations [2]. Prominent counterexamples are projects like AIDAVA (https://www.aidava.eu) or JIGSAW (https://research.manchester.ac.uk/en/projects/assembling-the-data-jigsaw-powering-robust-research-on-the-causes), which paved the way towards standardized semantic text annotations using existing clinical ontologies [3]. The German Medical Text Corpus (GeMTeX) project aims to provide the largest shareable German clinical document collection with conceptual annotations from SCT [4], [5], and hereby introduces its detailed and methodologically grounded annotation guidelines.

Methods: The guidelines are informed by experiences and annotation principles from comparable consortia [3]. These experiences emphasize focusing on the semantic core of the texts, specifically by utilizing unary predicates, the SCT concepts, and binary relations between them. The annotations should be as literal and straightforward as possible to prevent the annotators from over-interpreting the text spans.

Results: The GeMTeX semantic annotation guidelines define three major concept classes: Core Concepts, Modifier Concepts, and Qualifier Concepts. The Core Concepts are the most relevant concepts for our semantic annotation, as they include Clinical Conditions (e.g., the SCT hierarchies Clinical Finding and Disorder), Procedures, Medications, Substances, and Observables. The Modifier and Qualifier Concepts specify specific annotations if a Core Concept does not sufficiently express the corresponding text span. The Modifier Concepts covered by our guidelines comprise the SCT hierarchies Body Structure, Physical Object, and Organism. The Qualifier Concept is identical to the SCT hierarchy qualifier value and contains, among others, dates, units, or factuality statements. Familial diseases are featured as well.

To build traceable knowledge representations, we introduce relations between annotated concepts. The unlabeled and unidirectional relations follow intuitive and concise rules. For example, relations always point from a cause to its corresponding effect or go from the lower to the upper bound within a value range.

In addition, our guidelines are specifically tailored to the German language. Besides universal linguistic challenges like ambiguity, syntactic coordination, or copula constructions, we provide concise and comprehensible instructions for typical German peculiarities, e.g., separable particle verbs or unique terminology.

Conclusion: The GeMTeX semantic annotation guidelines are informed by large semantic annotation projects for free-text documents in the clinical domain. They specifically address the usage of SCT and underline linguistic and terminological challenges inherent to the German clinical language. Explicit annotation rules and decision-making criteria accompanied by illustrative examples are integral to enhancing consistency between annotators. This initial version of the annotation guidelines is designed to be a methodological resource for researchers in clinical NLP or terminology annotation, supporting reproducibility and transparency. We aim to contribute to further standardization efforts. The GeMTeX annotation guidelines are available under https://doi.org/10.5281/zenodo.15689931.

The authors declare that they have no competing interests.

The authors declare that a positive ethics committee vote has been obtained.


Literatur

[1] Hahn U. Clinical Document Corpora -- Real Ones, Translated and Synthetic Substitutes, and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data. Jamia Open. 2025;8(3):ooaf024. DOI: 10.1093/jamiaopen/ooaf024
[2] Jovanović J, Bagheri E. Semantic Annotation in Biomedicine: The Current Landscape. J Biomed Semant. 2017 Sep;8(1):44.
[3] Schulz S, Del-Pinto W, Han L, Kreuzthaler M, Aghaei S, Nenadic G. Towards Principles of Ontology-Based Annotation of Clinical Narratives 40 International (CC BY 40). In: Proceedings of the International Conference on Biomedical Ontologies 2023; 2023 Aug 28 - Sep 1; Brasilia, Brazil. (CEUR Workshop Proceedings; 3603). [cited 2025 Apr 25]. Available from: https://ceur-ws.org/Vol-3603/Paper4.pdf
[4] Meineke F, Modersohn L, Loeffler M, Boeker M. Announcement of the German Medical Text Corpus Project (GeMTeX). In: Caring is Sharing – Exploiting the Value in Data for Health and Innovation. Proceedings of MIE 2023. IOS; 2023. DOI: 10.3233/SHTI230283
[5] Lohr C, Matthies F, Faller J, Modersohn L, Riedel A, Hahn U, Kiser R, Boeker M, Meineke F. De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus. Stud Health Technol Inform. 2024 Aug 30;317:171-179. DOI: 10.3233/SHTI240853