Deep learning bone tumor entity classification model collapses under real-world distribution shifts

25dkou245 10.3205/25dkou245 urn:nbn:de:0183-25dkou2451 Meeting Abstract Deep learning bone tumor entity classification model collapses under real-world distribution shifts Curto Vilalta Curto Vilalta Anna A

Technical University of Munich, Munich, Deutschland Klinikum rechts der Isar, Munich, Deutschland

author del Val Guardiola del Val Guardiola Ines I

Technical University of Munich, Munich, Deutschland Klinikum rechts der Isar, Munich, Deutschland

author von Eisenhart-Rothe von Eisenhart-Rothe Rüdiger R

Klinikum rechts der Isar, Munich, Deutschland

author Consalvo Consalvo Sarah S

Klinikum rechts der Isar, Munich, Deutschland

author Rueckert Rueckert Daniel D author Hardes Hardes Jendrik J author Hinterwimmer Hinterwimmer Florian F

Technical University of Munich, Munich, Deutschland Klinikum rechts der Isar, Munich, Deutschland

author German Medical Science GMS Publishing House

Düsseldorf

610 20251031 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0634 245 Deutsche Gesellschaft für Orthopädie und Unfallchirurgie Deutsche Gesellschaft für Orthopädie und Orthopädische Chirurgie Deutsche Gesellschaft für Unfallchirurgie Berufsverband für Orthopädie und Unfallchirurgie Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2025) Grundlagenforschung | Osteoarthrose 2 Berlin 20251028 20251031 AB35-4055 TextObjectives and questions: Artificial intelligence (AI) models have demonstrated significant potential in classifying bone tumors. However, their clinical adoption remains limited due to poor generalizability across different healthcare centers. This study aims to assess the impact of training AI models on single-center data and evaluate their performance on datasets with distribution shifts caused by variations in imaging centers, scanners, and acquisition conditions.Material and methods: This retrospective study included x-rays from 635 patients diagnosed with Enchondroma or Atypical Cartilaginous Tumor (ACT). We used a pre-trained Vision Transformer to fine-tune it to classify bone tumor entities. A weighted cross-entropy loss function was applied to avoid a bias towards the majority class (enchondroma).To assess model robustness, we simulated real-world distribution shifts. We introduced three perturbation scenarios to the test set: (1) sensor noise, modeled by adding Gaussian noise; (2) reduced image quality, simulated via image blurring; and (3) variations in acquisition settings, mimicked by modifying brightness and contrast. Model performance was evaluated on test dataset using classification metrics, including accuracy, sensitivity, and specificity. For sensitivity and specificity calculations, we considered Enchondroma as the negative class and ACT as the positive class.Results: As shown in Table 1 , the model achieved an overall accuracy of 77%, with a sensitivity of 45.5% and a specificity of 89.3% on the test set. These results highlight the model’s difficulty in improving sensitivity for the minority class (ACT) due to the class imbalance.When evaluating model robustness on the modified test set, simulating real-world distribution shifts, the model’s ability to balance sensitivity across classes collapsed. Under these conditions, sensitivity for ACT dropped to 0%, while Enchondroma classification reached 100%, indicating that the model classified all cases as Enchondroma.Discussion and conclusions: Our results highlight the challenges of AI bias in bone tumor classification, with models trained on single-center data failing even under very small distribution shifts (Figure 1 ). The strong reliance on dataset-specific features raises concerns about their reliability in broader clinical settings. To improve robustness and generalizability, multi-center data sharing is essential for developing accurate AI-based diagnostic tools. 11

1 1 Figure 1 1 0 0