70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V.
Oblique splits in artificial representative trees for random forests
Text
Introduction: Artificial representative trees (ARTs) are an interpretation method for random forests where a single binary decision tree is generated as a surrogate model to be interpreted instead of the whole ensemble [1]. While random forests, as an ensemble of binary decision trees, can approximate linear relationships quite well, ARTs face challenges due to the necessity of numerous binary splits, leading to excessive complexity and difficulty in interpretation. Therefore, we propose the alternative method of oblique ARTs (oARTs), utilizing oblique splits (e.g., linear combinations of predictor variables) to more accurately approximate linear relationships following the concept of oblique random forests [2].
Recent works have suggested the use of tree-based surrogates such as most representative trees (MRTs) [3], [4] or small ensembles of trees [5], [6]. Moreover, oblique surrogate trees have shown superior fidelity by using linear combinations at splits [7]. By oARTs, we build on these advances by combining variables based on LDA and constructing an interpretable oblique tree.
Methods: In oARTs, we first use linear discriminant analysis (LDA) to identify and prioritize linear combinations of variables that are important for the prediction task. Subsequently, we extend the training data set for the generation of an artificial representative tree by synthetic variables based on the LDA results. Finally, an ART is generated as previously described.
In an extensive simulation study, we generated separate training and testing data sets for three different scenarios, linear and non-linear relationships with the outcome. We compared the new method of oARTs with classical ARTs and MRTs. Our main performance measures were the prediction accuracy on new data (accuracy), the similarity of prediction to the original forest prediction (fidelity), the fraction of included effect and noise variables (coverage), run time, and size of the resulting models.
Results: With regard to fidelity, classical ARTs perform best in settings where only a few effect variables influence the outcome. For settings with more effect variables, oARTs have better fidelity. Concerning accuracy, oARTs demonstrate optimal performance and are very close and in some cases even better than the performance of the original random forest. Given that oARTs can include multiple variables in each split, they also show the highest fraction of included effect variables while keeping the fraction of included noise variables to a minimum. Finally, they lead to the smallest models with a similar run time to classical ARTs.
Discussion: Our new method of oARTs is superior to ARTs and MRTs in data sets with linear relationships and leads to comparable results in the absence of any linear dependencies between predictor variables. An implementation of oARTs is available in our R package timbR (https://github.com/imbs-hl/timbR).
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Parts of this work have been presented before at the 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik (DAGStat 2025) in Berlin.
Literatur
[1] Laabs BH, Kronziel LL, König IR, Szymczak S. Construction of Artificial Most Representative Trees by Minimizing Tree-Based Distance Measures. In: Longo L, Lapuschkin S, Seifert C, editors. Explainable Artificial Intelligence. Cham: Springer Nature Switzerland; 2024. p. 290–310. (Communications in Computer and Information Science; 2154). DOI: 10.1007/978-3-031-63797-1_15[2] Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On Oblique Random Forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M, editors. Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011 p. 453–69. (Lecture Notes in Computer Science; 6912). DOI: 10.1007/978-3-642-23783-6_29
[3] Banerjee M, Ding Y, Noone AM. Identifying representative trees from ensembles. Statistics in Medicine. 2012 Jul 10;31(15):1601–16.
[4] Laabs BH, Westenberger A, König IR. Identification of representative trees in random forests based on a new tree-based distance measure. Adv Data Anal Classif. 2023. DOI: 10.1007/s11634-023-00537-7
[5] Szepannek G, von Holt BH. Can’t see the forest for the trees: Analyzing groves to explain random forests. Behaviormetrika. 2023. DOI: 10.1007/s41237-023-00205-2
[6] Sies A, Van Mechelen I. C443: a Methodology to See a Forest for the Trees. J Classif. 2020;37(3):730-753. DOI: 10.1007/s00357-019-09350-4
[7] Li H, Xu J, Armstrong WW. LHT: Statistically-Driven Oblique Decision Trees for Interpretable Classification [Preprint]. arXiv. 2025. DOI: 10.48550/ARXIV.2505.04139



