Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)
Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)
FractureGPT: A controlled comparative study of ChatGPT-4 and an orthopedic surgeon in detecting lower leg and ankle fractures
2Health and Medical University Erfurt, Erfurt, Deutschland
Text
Objectives and questions: Large Language Models have demonstrated a broad range of applications, including text editing, content creation and coding. In the medical field, ChatGPT has been explored in various contexts, such as answering patient questions, analyzing radiological reports, and fracture detection.
The aim of this study was to conduct a randomized, controlled comparative study to evaluate the diagnostic accuracy of ChatGPT-4 in detecting fractures, compared to an Orthopedic Surgeon.
Material and methods: Radiographs from the FracAtlas database were used to evaluate ChatGPT-4's diagnostic performance in fracture detection against an experienced orthopedic surgeon under controlled conditions. A total of 80 images (30 with fractures, 50 without) depicting the ankle region were selected, randomly sorted, and uploaded to FractureGPT, a custom-made GPT specifically designed for fracture identification. FractureGPT was instructed to describe the depicted anatomical region and to report the presence of a fracture.
The same images were presented to a trained orthopedic surgeon, who was tasked with determining the presence or absence of fractures. Accuracy, sensitivity, specificity, and the area under the curve (AUC) of the receiver operating characteristic (ROC) were calculated and compared.
Results: FractureGPT demonstrated a strong capability to detect and describe the depicted anatomical region, with an accuracy in identifying the correct projection of 97.54%. Overall accuracy for FractureGPT was 66.25% ± 5.29% and 93.75% ± 2.71% for the trained orthopedic surgeon. Sensitivity was 50% ± 9.13% for FractureGPT and 96.67% ± 3.28% for the orthopedic surgeon. Specificity for FractureGPT was 76% ± 6.04% and 92.00% ± 3.84% for the trained orthopedic surgeon.
Accuracy and sensitivity of the orthopedic surgeon was significantly better than FractureGPT when compared using McNemar’s test (p<0.000027 and p=0.00012, respectively). The difference in specificity was not significant when tested using McNemar’s test (p=0.057) but still did show a strong trend.
To compare diagnostic performance of FractureGPT and the orthopedic surgeon, the receiver operating characteristic (ROC) curve and the respective area under the curve (AUC) was calculated. AUC was 0.63 for FractureGPT and 0.94 for the orthopedic surgeon, which was significantly higher when compared using DeLong test (p<0.0000001).
Discussion and conclusions: While ChatGPT-4 demonstrates strong anatomical recognition, it falls significantly short of a trained orthopedic surgeon in fracture detection accuracy and sensitivity. While promising as a quick screening tool, ChatGPT-4 cannot replace trained medical expertise. However, integrating AI into fracture detection holds significant potential to enhance diagnostic workflows and enable rapid preliminary assessments.



