- Conferences Overview
- Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)
- Assessing the potential and capability of different Large-Language Models (LLMs) in spine surgery for patients and medical experts
German Congress of Orthopaedics and Traumatology (DKOU 2025)
28.-31.10.2025
Berlin
Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)
Assessing the potential and capability of different Large-Language Models (LLMs) in spine surgery for patients and medical experts
Text
Objectives and questions: Since their release, large-language models (LLMs) such as ChatGPT-4 Pro, Copilot Pro 16.03, Gemini Advanced 1.5 Pro, and Claude 3.5 Sonnet have demonstrated the ability to pass medical licensing exams and generate differential diagnoses. However, their accountability, capability, and trustworthiness in medical applications remain topics of debate. This study aimed to evaluate the effectiveness of different LLMs in answering common medical questions related to spinal surgery and providing recommendations for further diagnostics and treatment.
Material and methods: The study was conducted at a certified Level 1 spine center. A set of ten fictional medical questions and realistic clinical case scenarios was developed. The selected LLMs were prompted to respond to these cases from two perspectives: that of an expert in spine surgery and that of a patient. Responses were assessed for clinical completeness, correctness, and adaptability by two board-certified orthopedic surgeons using a five-point grading scale. Statistical analysis was performed to compare the results.
Results: The LLMs achieved high scores (>4) in most evaluated categories. Correctness and completeness were slightly higher than adaptability across all models. ChatGPT and Claude provided more detailed responses, using more sentences compared to Gemini and Copilot. No significant differences were found between the LLMs (p > 0.05) across the three outcome measures. Similarly, no significant differences were observed between responses tailored for medical experts versus patients (p > 0.05). However, responses for patients contained significantly shorter words on average, as measured by syllables per words.
Discussion and conclusions: Current LLMs provide reliable and well-structured information for both medical professionals and patients regarding spine surgery. As technology advances, the role of LLMs in supporting medical staff and educating patients is expected to expand further.



