Logo

70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)
07.-11.09.2025
Jena


Meeting Abstract

OverlapES: An R Package, and Accompanying R Shiny Application for Identifying and Quantifying Sample Overlap in Evidence Synthesis

Zhentian Zhang 1
Tim Friede 1
Tim Mathes 1
1Department of Medical Statisitcs, University Medical Center Göttingen, Göttingen, Germany

Text

Introduction: In medical research, evidence synthesis often involves the need for combing findings from multiple observational studies. A common challenge in this process is the potential overlap of samples across studies, especially when utilizing existing databases like registries [1]. Such overlaps can bias meta-analysis results and undermine the credibility of its conclusions. Therefore, addressing sample overlap is crucial for improving the validity of synthesized evidence.

State of the art: Current methods for handling sample overlap are primarily ad-hoc solutions that rely on access to individual-level data or unique identifiers, which are frequently unavailable due to privacy concerns or data regulation policies [2]. Some approaches correct the result of meta-analysis only in very specific cases [3] or assume known overlap parts [4], which have very limited applicability. Currently there are no practical tools for estimating sample overlap when only aggregate data is available.

Concept: To narrow this gap, we developed overlapES, an R package, accompanied by a shiny web application. These tools implement a novel method that is grounded in set theory, enabling the inference of sample overlap by utilizing the ranges of selected study-sample characteristics that are commonly available, such as the location and the time of data generation, patient characteristics. This approach enables the identification of potential overlaps without requiring individual-level data.

Implementation: The R package overlapES provides functions such as calculating the risks of overlap, visualizing the risks of overlap and finding the overlap-free set of studies with the largest sample size. The R shiny web application include similar functions, and additionally provides intuitive interfaces, allowing users to apply the method without extensive programming knowledge. We designed both tools to improve the accessibility of the methods, promoting broader adoption in the research community.

Lessons learned: Applying overlapES to practical examples proved to be useful for detecting potential sample overlaps. The tools enable easy application of standardized solution to describe and estimate the extent of overlap between studies, and provide an intuitive way to address it. Further developments will focus on improving the robustness and flexibility of the algorithms, incorporating additional functions and expanding applicability to other research domains.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Literatur

[1] Mathes T, Jacobs A, Pieper D. Systematic reviews and meta-analyses that include registry-based studies: methodological challenges and areas for future research. Journal of Clinical Epidemiology. 2023;156:119-122. DOI: 10.1016/j.jclinepi.2023.02.014
[2] Hussein H, Siddiqi K, Hossain FN, Sheikh A. Double-counting of populations in evidence synthesis in public health: a call for awareness and future methodological development. BMC Public Health. 2022;22:1827. DOI: 10.1186/s12889-022-14213-6
[3] Jin Q, Shi G. Meta-analysis of SNP-environment interaction with overlapping data. Frontiers in Genetics. 2020;10:1400. DOI: 10.3389/fgene.2019.01400
[4] Lin DY, Sullivan PF. Meta-analysis of genome-wide association studies with overlapping subjects. American journal of Human Genetics. 2009;85(6):862-72. DOI: 10.1016/j.ajhg.2009.11.001