70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Towards automatic inference of agents for cancer treatment from substance data in cancer registry data
2Hamburgisches Krebsregister, Hamburg, Germany
3Klinische Landesauswertungsstelle Niedersachsen (KLast), Oldenburg, Germany
4Universität Oldenburg, Abteilung für Assistenzsysteme und Medizintechnik, Oldenburg, Germany
Text
Introduction: In the German cancer registration, diagnostic or treatment events are reported using the oBDS format [1]. This format defines all relevant data fields, types, and values, except for few cases such as substances used in systemic therapies, which are reported as free text. These substances are supposed to describe exactly one agent per data field using its generic name, but are sometimes reported under their trade names, do contain typos, and many other issues. For example, the dataset at the KLast (Klinische Landesauswertungsstelle Niedersachsen) [2] contains 685,228 total substances comprising 8,655 unique values, but only 8% of these unique values properly describe such an agent, leaving 92% to be fixed manually by the documentalists which is both tedious and time-consuming.
This paper focuses on the automatic inference of agents from free text using actual substance data of the KLast and provides a two-fold contribution: We explore the dataset and characterize typical issues with the reported substances and we propose a preliminary approach using the Levenshtein distance [3], a metric to calculate the difference between two character sequences, to infer the agents.
Methods: Initially, we import a work-in-progress dictionary of the tumor best-of task force of the Plattform 65c [4] that maps known trade names and abbreviations to their correct agents. Then, all unknown substances (e.g. empty strings, variations of ‘UNKNOWN’, ‘NULL’, or ‘#NAME?’) are flagged as such.
Afterwards, if a substance is found in the dictionary with case-insensitive search, its agent is immediately determined. Otherwise, the Levenshtein distances between the given substance and all dictionary values are individually calculated. If only one value has the smallest distance to the given substance, it is considered a match. Otherwise, no match can be determined.
Results: We restrict ourselves to small Levenshtein distances of 1 to 3 between given substances and values in the dictionary. Using this approach, 609,655 of the total substances have been resolved, which accounts for 2,021 (23%) of all unique values. When ignoring our self-imposed rule of allowing small distances only, a theoretical maximum of 6,488 (75%) of all unique values may be resolved. Conversely, 2,167 (25%) of all unique values are always unresolvable.
Discussion: To avoid misclassifications, we currently advise against larger Levenshtein distances, as substance data occasionally contains several agents, which cannot be identified, leading to erroneously omitted agents. Smaller distances instead mostly indicate typos (e.g. Acazytidin instead of Azacitidin), inversions of phonems (e.g. Cytabarin instead of Cytarabin) or export artifacts (e.g. “Dexmethasadon” instead of Dexamethason) which can be resolved rather reliably.
Conclusion: To improve the approach, tokenization of the given substances before calculating Levenshtein distances seems worthwhile to handle cases where more than one agent has been reported at the same time. Moreover, applying phonetic algorithms such as the German Cologne phonetics [5] may be suitable to address the phonetic issues. Furthermore, the ratio of Levenshtein distance to total length of a given substance as well as the delta between closest and next-closest matches could be incorporated to assess the reliability. In any case, the approach should especially avoid false positive matches, since a fallback to human documentalists still is an option.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
[1] Bundeseinheitlicher onkologischer Basisdatensatz. Aktuelle Versionen der Dateien zur XML-Schnittstelle oBDS. Arbeitsgemeinschaft Deutscher Tumorzentren e.V.; [cited 2025 Apr 16]. Available from: https://www.basisdatensatz.de/xml/[2] Klinische Landesauswertungsstelle Niedersachsen. Aufgaben der KLast. OFFIS CARE GmbH; [cited 2025 Apr 16]. Available from: https://www.klast-n.de/aufgaben-ziele.html
[3] Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 1966;10(8):707-710.
[4] Plattform § 65c. Klinische Krebsregister Sachsen-Anhalt GmbH; [cited 2025 Apr 16]. Available from: https://plattform65c.de/
[5] Postel HJ. Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. IBM-Nachrichten. 1969;19:925-931.



