FractureGPT: A controlled comparative study of ChatGPT-4 and an orthopedic surgeon in detecting lower leg and ankle fractures

25dkou481 10.3205/25dkou481 urn:nbn:de:0183-25dkou4811 Meeting Abstract FractureGPT: A controlled comparative study of ChatGPT-4 and an orthopedic surgeon in detecting lower leg and ankle fractures Hofmann Hofmann Max M

Klinik für Unfallchirurgie, Orthopädie und Handchirurgie, Helios Klinikum Erfurt, Erfurt, Deutschland Health and Medical University Erfurt, Erfurt, Deutschland

author Fahr Fahr Robert R

Klinik für Unfallchirurgie, Orthopädie und Handchirurgie, Helios Klinikum Erfurt, Erfurt, Deutschland Health and Medical University Erfurt, Erfurt, Deutschland

author Elsaghir Elsaghir Ahmed A

Klinik für Unfallchirurgie, Orthopädie und Handchirurgie, Helios Klinikum Erfurt, Erfurt, Deutschland Health and Medical University Erfurt, Erfurt, Deutschland

author Mückley Mückley Thomas T

Klinik für Unfallchirurgie, Orthopädie und Handchirurgie, Helios Klinikum Erfurt, Erfurt, Deutschland

author German Medical Science GMS Publishing House

Düsseldorf

610 20251031 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0634 481 Deutsche Gesellschaft für Orthopädie und Unfallchirurgie Deutsche Gesellschaft für Orthopädie und Orthopädische Chirurgie Deutsche Gesellschaft für Unfallchirurgie Berufsverband für Orthopädie und Unfallchirurgie Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2025) Abstracts | Digitalisierung 3 Berlin 20251028 20251031 AB76-4039 TextObjectives and questions: Large Language Models have demonstrated a broad range of applications, including text editing, content creation and coding. In the medical field, ChatGPT has been explored in various contexts, such as answering patient questions, analyzing radiological reports, and fracture detection.The aim of this study was to conduct a randomized, controlled comparative study to evaluate the diagnostic accuracy of ChatGPT-4 in detecting fractures, compared to an Orthopedic Surgeon. Material and methods: Radiographs from the FracAtlas database were used to evaluate ChatGPT-4's diagnostic performance in fracture detection against an experienced orthopedic surgeon under controlled conditions. A total of 80 images (30 with fractures, 50 without) depicting the ankle region were selected, randomly sorted, and uploaded to FractureGPT, a custom-made GPT specifically designed for fracture identification. FractureGPT was instructed to describe the depicted anatomical region and to report the presence of a fracture.The same images were presented to a trained orthopedic surgeon, who was tasked with determining the presence or absence of fractures. Accuracy, sensitivity, specificity, and the area under the curve (AUC) of the receiver operating characteristic (ROC) were calculated and compared.Results: FractureGPT demonstrated a strong capability to detect and describe the depicted anatomical region, with an accuracy in identifying the correct projection of 97.54%. Overall accuracy for FractureGPT was 66.25% ± 5.29% and 93.75% ± 2.71% for the trained orthopedic surgeon. Sensitivity was 50% ± 9.13% for FractureGPT and 96.67% ± 3.28% for the orthopedic surgeon. Specificity for FractureGPT was 76% ± 6.04% and 92.00% ± 3.84% for the trained orthopedic surgeon. Accuracy and sensitivity of the orthopedic surgeon was significantly better than FractureGPT when compared using McNemar’s test (p<0.000027 and p=0.00012, respectively). The difference in specificity was not significant when tested using McNemar’s test (p=0.057) but still did show a strong trend.To compare diagnostic performance of FractureGPT and the orthopedic surgeon, the receiver operating characteristic (ROC) curve and the respective area under the curve (AUC) was calculated. AUC was 0.63 for FractureGPT and 0.94 for the orthopedic surgeon, which was significantly higher when compared using DeLong test (p<0.0000001).Discussion and conclusions: While ChatGPT-4 demonstrates strong anatomical recognition, it falls significantly short of a trained orthopedic surgeon in fracture detection accuracy and sensitivity. While promising as a quick screening tool, ChatGPT-4 cannot replace trained medical expertise. However, integrating AI into fracture detection holds significant potential to enhance diagnostic workflows and enable rapid preliminary assessments. 0 0 0 0