70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Impact of Analytical Decisions and Extreme Data on the Results of scRNA-seq Analyses
Text
Introduction: Bioinformatics analysis of scRNA-seq experiments consists of a large number of steps, including steps for data processing and steps which generate results for biological interpretation. For each step, a certain number of methods and software is available, and the characteristics of the individual tools are often not fully understood. Therefore, analysts tend to use sometimes multiple approaches, and in consequence obtain different results and finally draw different biological conclusions. Furthermore, extreme data points (i.e. individual cells in the border area of their cluster) can have an impact on the results as well. While current for scRNA-seq analysis only provide means to judge uncertainty in individual steps, there is no wholistic approach to compare robustness of results from different analysis pipelines or with respect to cells of unclear cluster membership.
Methods: We have developed an new analytical approach, which helps analysts to study the impact of analytical decisions (i.e. choice of methods) and of extreme data points on the results of scRNA-seq. Extreme data points can be individual cells that are in the border area of their cell cluster, and which may not perfectly represent the particular cell type. As part of our new approach, we present scTrimClust, a new computational method that enhances the reliability of scRNA-seq analyses by identifying cells with extreme expression profile [1]. The method uses a alpha-concave hull boundary technique to distinguish well-defined core cells from those at cluster edges. A hull is built around each 2-dimensional cell cluster, and a cell's distance to the border shows how well it fits the cluster. Researchers can than analyze the stability of findings (e.g. selected marker genes or differentially expressed genes) in presence or absence of such ‘outlying’ cells.
Results: We validate scTrimClust on two publicly available datasets – Peripheral Blood Mononuclear Cells (PBMCs) and a COVID-19 dataset – demonstrating its effectiveness in evaluating cluster robustness. Our analysis shows that while some marker genes remain stable despite extreme cells, others are highly sensitive to their removal, highlighting the importance of careful data handling. Additionally, scTrimClust enables systematic comparisons of clustering and normalization parameters as well as other methodical choices.
Conclusion: Thus, our approach offers a valuable strategy for refining single-cell analyses and assessing the result’s robustness. Implementing the idea of concave hulls, our approach overcomes the limitation of most outlier detection methods that rely on convex distributions. The functionality of scTrimClust is integrated into the R-package ‘RepeatedHighDim’. Simulation studies are planned for further evaluation of our method under controlled settings.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.



