Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Real-world benchmarking of statistical software for feature selection in longitudinal biomedical data

Alexander Gieswinkel 1,2,3
Gregor Buch 1,3
Vincent ten Cate 1,3,4
Gökhan Gül 1,4
Lisa Hartung 2
Philipp Wild 1,3,4,5
1Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
2Institute of Mathematics, Johannes Gutenberg University Mainz, Mainz, Germany
3German Center for Cardiovascular Research (DZHK), partner site Rhine Main, Mainz, Germany
4Clinical Epidemiology and Systems Medicine, Center for Thrombosis and Hemostasis, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
5Institute of Molecular Biology (IMB), Mainz, Germany

Text

Introduction: Recent advances in biochemical technology enable increasing availability of high-dimensional omics data for multiple time points in prospective cohort studies. Supervised feature selection is often required in these settings to overcome dimensionality problems and to achieve biomedical interpretability. To date, an overview of existing methods for this framework is lacking. This highlights the need for a systematic review and evaluation of this area.

Methods: A systematic search of statistical software was conducted to identify appropriate methods. The Comprehensive R Archive Network (CRAN) was examined via the R package packagefinder using a search query containing relevant keywords [1]. Eligible software was identified by manually screening the package descriptions, and through computational testing with a fixed application example. The inter- and intra-class correlation structure of longitudinal proteomic data was analysed to generate synthetic data for an ADEMP-designed simulation study to support the implication of the findings to real-world cohort data [2], [3]. Feature selection performance of identified methods was evaluated in real-world scenarios, considering varying sample sizes, total number of predictors and true positives, time points and signal-to-noise ratios. Only frequentist implementations [4] with given default settings were included in the Monte Carlo simulations for a fair comparison. The estimated true positive rate (eTPR) and estimated false discovery rate (eFDR) were chosen as targeted performance measures.

Results: Of 21,528 accessible packages on CRAN (status: June 2024), 324 packages with matching keywords in the descriptions were extracted by the search query. Screening of the descriptions identified 45 packages that were then tested in R. Methods for inappropriate settings (N=11), absent variable selection (N=5), not applicable to the predefined testing data (N=4) or other reasons (N=11) were excluded. Six of these remaining 14 methods were based on mixed effects models (buildmer, rpql, splmm, alqrfe, plsmmLasso, glmmLasso), five on generalized estimating equations (sgee, LassoGEE, geeVerse, PGEE, pgee.mixed), two methods were built on Bayesian frameworks (sparsereg, spikeSlabGAM) and one package was modelling time series (midasml). All implementations were able to process continuous outcomes, while only four supported binary outcomes. A total of N=8 frequentist methods with ‘ready-to-use’ default settings were considered in the simulation study.

The packages buildmer and pslmmLasso consistently demonstrated an eTPR exceeding 80% while maintaining the eFDR under 20%, across various signal-to-noise settings. By comparison, all other methods underperformed in jointly evaluating both performance metrics. splmm achieved similar eFDR but yielded lower eTPR, whereas geeVerse showed an opposite trend.

Discussion: The majority of the available statistical software is based on frequentist techniques, while Bayesian procedures represent a minority. Alternative concepts like tree-based methods are notably absent. There was no evidence of superiority for modern selection techniques such as regularized regression (pslmmLasso) over traditional approaches like stepwise regression (buildmer) for feature selection in longitudinal data.

Future analysis will include Bayesian methods in the simulation study, and provide a more comprehensive examination of the results.

Conclusion: A variety of statistical software is available for supervised feature selection in longitudinal biomedical data. Among these, methods based on mixed-effects models appear to outperform generalized estimating equations.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.

The contribution has already been presented as a poster at 46th Annual Conference of the International Society for Clinical Biostatistics (ISCB): Systematic review and real life-oriented evaluation on methods for feature selection in longitudinal biomedical data


Literatur

[1] Buch G, Wild PS. Investigating selection strategies for identifying biometrical techniques: a case study on group variable selection methods in R. In: 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 303 DOI: 10.3205/23gmds081
[2] Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074-102.
[3] Schulz A, Zoller D, Nickels S, Beutel ME, Blettner M, Wild PS, et al. Simulation of complex data structures for planning of studies with focus on biomarker comparison. BMC Med Res Methodol. 2017;17(1):90.
[4] Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. Second Edition. Hoboken, NJ: Wiley-Interscience; 2011.