Deutsche Gesellschaft für Orthopädie und Unfallchirurgie (DGOU), Deutsche Gesellschaft für Orthopädie und Orthopädische Chirurgie (DGOOC), Deutsche Gesellschaft für Unfallchirurgie (DGU), Berufsverband für Orthopädie und Unfallchirurgie (BVOU)
28.-31.10.2025
Berlin

Weiter

Meeting Abstract

Closing the data gap: A fully automated AI pipeline for digitizing orthopedic paper records into structured registries

Johannes Pawelczyk - Sektion Sportorthopädie, Klinikum rechts der Isar, Technical University of Munich, Munich, Deutschland

Lucas Mair - Technical University of Munich, Munich, Deutschland

Kok Yuan Tang - TUM.ai Student Initiative, Technical University of Munich, Munich, Deutschland

Raef Larbi - TUM.ai Student Initiative, Technical University of Munich, Munich, Deutschland

Tien Duc Nguyen - TUM.ai Student Initiative, Technical University of Munich, Munich, Deutschland

Jorge Padilla Pérez - TUM.ai Student Initiative, Technical University of Munich, Munich, Deutschland

Lennart Gerdesmeyer - Sektion Sportorthopädie, Klinikum rechts der Isar, Technical University of Munich, Munich, Deutschland

Emre Nokay - Sektion Sportorthopädie, Klinikum rechts der Isar, Technical University of Munich, Munich, Deutschland

Rafael Vartian - Technical University of Munich, Munich, Deutschland

Sebastian Voigtländer - Virtual Diagnostics Unit, QuantCo Inc., Boston, USA

Sebastian Siebenlist - Sektion Sportorthopädie, Klinikum rechts der Isar, Technical University of Munich, Munich, Deutschland

Marco-Christopher Rupp - Sektion Sportorthopädie, Klinikum rechts der Isar, Technical University of Munich, Munich, Deutschland

Text

Objectives and questions: The digitization of paper-based medical documents presents significant challenges for healthcare institutions. Thousands of unstructured medical documents remain unused in hospital archives, despite containing valuable information for research and treatment optimization. This study aimed to develop an artificial intelligence (AI)-powered pipeline for the automated extraction of clinical variables of interest from scanned medical documents using large language models (LLMs).

Material and methods: A multi-stage processing pipeline was developed, combining optical character recognition (OCR) and LLMs (Figure 1 [Abb. 1]). The pipeline consists of (i) OCR-based text extraction, (ii) LLM-assisted text correction, (iii) LLM-powered extraction of clinically relevant variables, and (iv) storage in a structured database. Various open-source, and proprietary LLMs were evaluated and compared, including OpenAI’s popular ‘GPT-4o’ and ‘GPT-3.5’, as well as Meta’s ‘Llama 3 8B’ and ‘Llama 3 70B’.

Figure 1: Comprehensive pipeline for fully automated digitization of medical documents using optical character recognition and large language models, including the extraction of clinically relevant variables and storage in a structured database format, enabling queries for downstream applications.

Results: The developed pipeline achieved a near-perfect OCR accuracy of 97.29 ± 1.91% on a dataset of 348 radiology reports (average bag-of-words similarity). In terms of clinical variable extraction (11 variables), the proprietary model ‘GPT-4o’ achieved an accuracy of 90% on a preliminary test dataset of 72 reports, while ‘GPT-3.5’ achieved 79%. Notably, the local, open-source models ‘Llama 3 70B’ and ‘Llama 3 8B’ achieved accuracies of 85% and 84%, respectively. The processing time per document was approximately six seconds. Certain variables were extractable at significantly higher average accuracies, e.g., affected side (97–100%), presence of joint effusion (92–99%), or anterior cruciate ligament pathology (90–94%).

Discussion and conclusions: The proposed pipeline enables efficient, fully automated digitization of medical documents while maintaining high extraction accuracy. Local, open-source LLMs achieved performance approaching leading proprietary models. This has significant implications for medical research and care delivery by making previously inaccessible datasets usable—such as transforming existing records into structured registries.

Citation Note

Pawelczyk J, Mair L, Tang KY, Larbi R, Nguyen TD, Padilla Pérez J, Gerdesmeyer L, Nokay E, Vartian R, Voigtländer S, Siebenlist S, Rupp MC. Closing the data gap: A fully automated AI pipeline for digitizing orthopedic paper records into structured registries In: Deutsche Gesellschaft für Orthopädie und Unfallchirurgie, Deutsche Gesellschaft für Orthopädie und Orthopädische Chirurgie, Deutsche Gesellschaft für Unfallchirurgie, Berufsverband für Orthopädie und Unfallchirurgie, editors. Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2025). Berlin, 28.-31.10.2025. Düsseldorf: German Medical Science GMS Publishing House; 2025. DocAB72-4617.

DOI: 10.3205/25dkou460

Download XML

Lizenz

© Pawelczyk et al.
Dieser Abstract steht unter den Bedingungen der Lizenz: Creative Commons Attribution 4.0 International

Veröffentlicht: 2025-10-31

Get in touch.

Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)

Closing the data gap: A fully automated AI pipeline for digitizing orthopedic paper records into structured registries

Text

ZB MED is a member of DataCite

ZB MED advocates gender equality

Award for German Medical Science

ZB MED advocates Open Access