70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
A systematic review of statistical model selection techniques used to predict COVID-19 health outcomes
2Medical University of Vienna, Center for Medical Data Science, Institute of Clinical Biometrics, Vienna, Austria
3Medical University of Vienna, Department of Medicine III, Division of Nephrology and Dialysis, Vienna, Austria
4Malawi Liverpool Wellcome Research Programme, Blantyre, Blantyre, Malawi
5Department of Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool, United Kingdom
Text
Background: Following the outbreak of the Covid-19 pandemic, the scientific community rapidly developed models to predict health outcomes. The Strengthening Analytical Thinking for Observational Studies (STRATOS) initiative's Topic Group 2 (TG2) ‘selection of variables and functional forms in multivariable analysis’ has initiated a review of the variable and functional form selection techniques used in these publications. It builds on an existing work by Wynants et al. [1] but focuses on selection approaches. TG2 members hypothesised that, during the health crisis, researchers relied on methods that they considered trustworthy and robust. Therefore, these models provide a valuable opportunity to examine which methods are currently used to select variables and functional forms.
Method: A detailed study protocol was developed, including information on the objectives, eligibility criteria, the procedure for identifying a paper's primary model, and an overview of the data extraction process. On this basis, a structured questionnaire was designed to collect detailed information about the modelling strategy and its suitability confirmed by a pilot study. Both documents were approved by TG2 members and registered at the Open Science Framework (https://osf.io/2afuz/) prior to the review process. Articles that had been systematically identified by Wynants et al. [1] were re-reviewed by 20 independent statistical reviewers. Two reviewers extracted data from each article and resolved discrepancies by consensus. The main interest was in the modelling steps involved in selecting variables, interactions, and functional forms, but descriptive statistics, reported model features, and implementation details were also extracted.
Results: Data from 181 regression-based prediction models covering linear, logistic, and time-to-event models were extracted. Considerable variability in model selection approaches was observed, with researchers often combining multiple statistical methods. Unidimensional approaches were frequently combined with multidimensional techniques without clear rationale. Variable selection during multivariable outcome modelling was commonly performed using p-values, backward elimination, or the Least Absolute Shrinkage and Selection Operator (LASSO). Interaction effects and non-linear relationships were rarely considered. If done, splines or multivariable fractional polynomials were used for the latter. Confidence intervals for model coefficients were given in many papers, but the additional uncertainty introduced by model selection was generally ignored. Information on descriptive statistics was generally adequate, while information on statistical modelling required to replicate the results was regularly insufficient. Analysis code that could clarify these aspects was almost never provided. Notably, the limited existing best practice recommendations for modelling were rarely referenced.
Discussion: The review shows a broad reliance on ad-hoc modeling strategies combining relatively simple but commonly used modelling strategies, with sub-optimal properties. This underlines the need for efforts to raise awareness of recommended modelling strategies and the importance of increased training with tutorials and examples. To improve reproducibility, a stronger emphasis on sharing of analysis code could be beneficial. Journals could play a crucial role in these aspects by promoting adherence to reporting guidelines.
In addition to a comprehensive summary of model selection techniques used in practice, the talk will cover examples of hard-to-comprehend descriptions of statistical methods that illustrate our experiences.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.



