Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

Enhancing Transparency in Research: Integrating Initial Data Analysis into Statistical Analysis Plans

Carsten Oliver Schmidt 1
Georg Heinze 2
Lara Lusa 3
Marianne Huebner 4
1Universität Greifswald, Greifswald, Germany
2Medical University of Vienna, Wien, Austria
3Faculty of Medicine, Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
4Department of Statistics and Probability, Michigan State University, East Lansing, United States

Text

Introduction: Statistical Analysis Plans (SAPs) are fundamental to ensuring transparency, reproducibility, and methodological rigour in statistical research. However, the critical phase of Initial Data Analysis (IDA) – assessing data suitability for subsequent analyses – remains insufficiently represented in SAPs and often poorly reported in scientific publications [1]. Given the complexity of IDA, encompassing detailed data quality assessments, evaluation of missing data, exploration of univariate and multivariate properties, and many preprocessing steps [2], [3], [4], there is a high need to formalise its planning and documentation within SAPs.

Methods: To address this gap, the TG3 “Initial Data Analysis” working group within the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative developed an extension to the conventional SAP: the Statistical Analysis Plan for Initial Data Analysis (SAPI). A Delphi-based consensus process involving experienced statisticians from all STRATOS topic groups was conducted over several rounds to identify, refine, and integrate key components of IDA into a structured plan aligned with best practices for main data analysis (MDA).

Results: The resulting SAPI is organised into eight comprehensive sections:

  1. Administrative Information, encompassing project documentation, ethical approvals, and team contacts;
  2. Project Background, outlining research aims, and target populations;
  3. Observation Units, detailing data sources, sampling, and dataset descriptions;
  4. Variables, specifying variables for the main and initial data analysis;
  5. Methods for Main Data Analysis (MDA), covering the description of observation units, model specifications, assumptions, sample size requirements, and planned sensitivity analyses;
  6. Methods for Initial Data Analysis (IDA), providing detailed guidance on data preparation, assessment of unit and item missingness, univariable and multivariable data descriptions;
  7. Evaluation and Updates, addressing procedures for revising the SAPI following IDA; and
  8. Supplementary Information, including key references, ensuring sustainable handling of analysis outputs.

Within the SAPI structure, the MDA is specified prior to the IDA, as the former defines the scope and nature of required data checks.

Discussion: Formalising IDA within a dedicated SAPI enhances transparency, strengthens reproducibility, and promotes a more structured approach to data assessment prior to main analyses. Planning and reporting IDA activities with comparable rigour to MDA ensures that critical preparatory steps are visible, and reproducible. The SAPI is currently undergoing application testing across diverse studies to evaluate needs for improvements.

Conclusion: The SAPI framework addresses a significant deficiency in traditional SAP practices by systematically integrating Initial Data Analysis. It explicitly adopts a much broader perspective on the entire data lifecycle than a conventional SAP. Through structured planning and transparent reporting, it contributes to improving the quality, reproducibility, and credibility of empirical research, aligning with wider scientific efforts to foster rigorous and transparent methodology.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

[1] Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L; Topic Group “Initial Data Analysis” of the STRATOS Initiative (STRengthening Analytical Thinking for Observational Studies). Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Methodol. 2020 Mar 13;20(1):61.
[2] Heinze G, Baillie M, Lusa L, Sauerbrei W, Schmidt CO, Harrell FE, et al. Regression without regrets -initial data analysis is a prerequisite for multivariable regression. BMC Med Res Methodol. 2024;24(1):178.
[3] Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, et al. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS ONE. 2024;19(5):e0295726.
[4] Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21(1):63.