German Congress of Orthopaedics and Traumatology (DKOU 2025)
Deutscher Kongress für Orthopädie und Unfallchirurgie 2025 (DKOU 2025)
Deep learning bone tumor entity classification model collapses under real-world distribution shifts
2Klinikum rechts der Isar, Munich, Deutschland
Text
Objectives and questions: Artificial intelligence (AI) models have demonstrated significant potential in classifying bone tumors. However, their clinical adoption remains limited due to poor generalizability across different healthcare centers. This study aims to assess the impact of training AI models on single-center data and evaluate their performance on datasets with distribution shifts caused by variations in imaging centers, scanners, and acquisition conditions.
Material and methods: This retrospective study included x-rays from 635 patients diagnosed with Enchondroma or Atypical Cartilaginous Tumor (ACT). We used a pre-trained Vision Transformer to fine-tune it to classify bone tumor entities. A weighted cross-entropy loss function was applied to avoid a bias towards the majority class (enchondroma).
To assess model robustness, we simulated real-world distribution shifts. We introduced three perturbation scenarios to the test set: (1) sensor noise, modeled by adding Gaussian noise; (2) reduced image quality, simulated via image blurring; and (3) variations in acquisition settings, mimicked by modifying brightness and contrast. Model performance was evaluated on test dataset using classification metrics, including accuracy, sensitivity, and specificity. For sensitivity and specificity calculations, we considered Enchondroma as the negative class and ACT as the positive class.
Results: As shown in Table 1 [Tab. 1], the model achieved an overall accuracy of 77%, with a sensitivity of 45.5% and a specificity of 89.3% on the test set. These results highlight the model’s difficulty in improving sensitivity for the minority class (ACT) due to the class imbalance.
Table 1: Model performance on the original and modified test datasets.
When evaluating model robustness on the modified test set, simulating real-world distribution shifts, the model’s ability to balance sensitivity across classes collapsed. Under these conditions, sensitivity for ACT dropped to 0%, while Enchondroma classification reached 100%, indicating that the model classified all cases as Enchondroma.
Discussion and conclusions: Our results highlight the challenges of AI bias in bone tumor classification, with models trained on single-center data failing even under very small distribution shifts (Figure 1 [Abb. 1]). The strong reliance on dataset-specific features raises concerns about their reliability in broader clinical settings. To improve robustness and generalizability, multi-center data sharing is essential for developing accurate AI-based diagnostic tools.




