Oblique splits in artificial representative trees for random forests

25gmds068 10.3205/25gmds068 urn:nbn:de:0183-25gmds0689 Meeting Abstract Oblique splits in artificial representative trees for random forests Laabs Laabs Björn-Hergen BH

Institute for Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany

author Bakare Bakare Janet J

Institute for Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany

author Szymczak Szymczak Silke S

Institute for Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany

author German Medical Science GMS Publishing House

Düsseldorf

610 random forest artifical representative trees oblique random forest explainable artificial intelligence most representative trees 20251103 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). M0631 068 Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) V: Machine learning and AI applications 1 Jena 20250907 20250911 Abstr. 352 TextIntroduction: Artificial representative trees (ARTs) are an interpretation method for random forests where a single binary decision tree is generated as a surrogate model to be interpreted instead of the whole ensemble . While random forests, as an ensemble of binary decision trees, can approximate linear relationships quite well, ARTs face challenges due to the necessity of numerous binary splits, leading to excessive complexity and difficulty in interpretation. Therefore, we propose the alternative method of oblique ARTs (oARTs), utilizing oblique splits (e.g., linear combinations of predictor variables) to more accurately approximate linear relationships following the concept of oblique random forests .Recent works have suggested the use of tree-based surrogates such as most representative trees (MRTs) , or small ensembles of trees , . Moreover, oblique surrogate trees have shown superior fidelity by using linear combinations at splits . By oARTs, we build on these advances by combining variables based on LDA and constructing an interpretable oblique tree.Methods: In oARTs, we first use linear discriminant analysis (LDA) to identify and prioritize linear combinations of variables that are important for the prediction task. Subsequently, we extend the training data set for the generation of an artificial representative tree by synthetic variables based on the LDA results. Finally, an ART is generated as previously described.In an extensive simulation study, we generated separate training and testing data sets for three different scenarios, linear and non-linear relationships with the outcome. We compared the new method of oARTs with classical ARTs and MRTs. Our main performance measures were the prediction accuracy on new data (accuracy), the similarity of prediction to the original forest prediction (fidelity), the fraction of included effect and noise variables (coverage), run time, and size of the resulting models. Results: With regard to fidelity, classical ARTs perform best in settings where only a few effect variables influence the outcome. For settings with more effect variables, oARTs have better fidelity. Concerning accuracy, oARTs demonstrate optimal performance and are very close and in some cases even better than the performance of the original random forest. Given that oARTs can include multiple variables in each split, they also show the highest fraction of included effect variables while keeping the fraction of included noise variables to a minimum. Finally, they lead to the smallest models with a similar run time to classical ARTs.Discussion: Our new method of oARTs is superior to ARTs and MRTs in data sets with linear relationships and leads to comparable results in the absence of any linear dependencies between predictor variables. An implementation of oARTs is available in our R package timbR (https://github.com/imbs-hl/timbR).The authors declare that they have no competing interests.The authors declare that an ethics committee vote is not required.Parts of this work have been presented before at the 7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik (DAGStat 2025) in Berlin. Laabs BH Kronziel LL König IR Szymczak S Construction of Artificial Most Representative Trees by Minimizing Tree-Based Distance Measures. In: Longo L, Lapuschkin S, Seifert C, editors Cham Explainable Artificial Intelligence Springer Nature Switzerland; 2024. p. 290–310 Laabs BH, Kronziel LL, König IR, Szymczak S. Construction of Artificial Most Representative Trees by Minimizing Tree-Based Distance Measures. In: Longo L, Lapuschkin S, Seifert C, editors. Explainable Artificial Intelligence. Cham: Springer Nature Switzerland; 2024. p. 290–310. (Communications in Computer and Information Science; 2154). DOI: 10.1007/978-3-031-63797-1_15 http://dx.doi.org/10.1007/978-3-031-63797-1_15 Menze BH Kelm BM Splitthoff DN Koethe U Hamprecht FA On Oblique Random Forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M, editors Berlin, Machine Learning and Knowledge Discovery in Databases Springer Berlin Heidelberg; 2011 p. 453–69 Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On Oblique Random Forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M, editors. Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011 p. 453–69. (Lecture Notes in Computer Science; 6912). DOI: 10.1007/978-3-642-23783-6_29 http://dx.doi.org/10.1007/978-3-642-23783-6_29 Banerjee M Ding Y Noone AM Identifying representative trees from ensembles 2012 Statistics in Medicine 1601–16 Banerjee M, Ding Y, Noone AM. Identifying representative trees from ensembles. Statistics in Medicine. 2012 Jul 10;31(15):1601–16. Laabs BH Westenberger A König IR Identification of representative trees in random forests based on a new tree-based distance measure 2023 Adv Data Anal Classif Laabs BH, Westenberger A, König IR. Identification of representative trees in random forests based on a new tree-based distance measure. Adv Data Anal Classif. 2023. DOI: 10.1007/s11634-023-00537-7 https://doi.org/10.1007/s11634-023-00537-7 Szepannek G von Holt BH Can’t see the forest for the trees: Analyzing groves to explain random forests 2023 Behaviormetrika Szepannek G, von Holt BH. Can’t see the forest for the trees: Analyzing groves to explain random forests. Behaviormetrika. 2023. DOI: 10.1007/s41237-023-00205-2 http://dx.doi.org/10.1007/s41237-023-00205-2 Sies A Van Mechelen I C443: a Methodology to See a Forest for the Trees 2020 J Classif 730-753 Sies A, Van Mechelen I. C443: a Methodology to See a Forest for the Trees. J Classif. 2020;37(3):730-753. DOI: 10.1007/s00357-019-09350-4 http://dx.doi.org/10.1007/s00357-019-09350-4 Li H Xu J Armstrong WW LHT: Statistically-Driven Oblique Decision Trees for Interpretable Classification [Preprint] 2025 arXiv Li H, Xu J, Armstrong WW. LHT: Statistically-Driven Oblique Decision Trees for Interpretable Classification [Preprint]. arXiv. 2025. DOI: 10.48550/ARXIV.2505.04139 http://dx.doi.org/10.48550/ARXIV.2505.04139 0 0 0 0