Accurate and interpretable regression trees using oracle coaching

2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) Pub Date : 2014-12-01 DOI:10.1109/CIDM.2014.7008667

U. Johansson, Cecilia Sönströd, Rikard König

{"title":"Accurate and interpretable regression trees using oracle coaching","authors":"U. Johansson, Cecilia Sönströd, Rikard König","doi":"10.1109/CIDM.2014.7008667","DOIUrl":null,"url":null,"abstract":"In many real-world scenarios, predictive models need to be interpretable, thus ruling out many machine learning techniques known to produce very accurate models, e.g., neural networks, support vector machines and all ensemble schemes. Most often, tree models or rule sets are used instead, typically resulting in significantly lower predictive performance. The overall purpose of oracle coaching is to reduce this accuracy vs. comprehensibility trade-off by producing interpretable models optimized for the specific production set at hand. The method requires production set inputs to be present when generating the predictive model, a demand fulfilled in most, but not all, predictive modeling scenarios. In oracle coaching, a highly accurate, but opaque, model is first induced from the training data. This model (“the oracle”) is then used to label both the training instances and the production instances. Finally, interpretable models are trained using different combinations of the resulting data sets. In this paper, the oracle coaching produces regression trees, using neural networks and random forests as oracles. The experiments, using 32 publicly available data sets, show that the oracle coaching leads to significantly improved predictive performance, compared to standard induction. In addition, it is also shown that a highly accurate opaque model can be successfully used as a pre-processing step to reduce the noise typically present in data, even in situations where production inputs are not available. In fact, just augmenting or replacing training data with another copy of the training set, but with the predictions from the opaque model as targets, produced significantly more accurate and/or more compact regression trees.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2014.7008667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In many real-world scenarios, predictive models need to be interpretable, thus ruling out many machine learning techniques known to produce very accurate models, e.g., neural networks, support vector machines and all ensemble schemes. Most often, tree models or rule sets are used instead, typically resulting in significantly lower predictive performance. The overall purpose of oracle coaching is to reduce this accuracy vs. comprehensibility trade-off by producing interpretable models optimized for the specific production set at hand. The method requires production set inputs to be present when generating the predictive model, a demand fulfilled in most, but not all, predictive modeling scenarios. In oracle coaching, a highly accurate, but opaque, model is first induced from the training data. This model (“the oracle”) is then used to label both the training instances and the production instances. Finally, interpretable models are trained using different combinations of the resulting data sets. In this paper, the oracle coaching produces regression trees, using neural networks and random forests as oracles. The experiments, using 32 publicly available data sets, show that the oracle coaching leads to significantly improved predictive performance, compared to standard induction. In addition, it is also shown that a highly accurate opaque model can be successfully used as a pre-processing step to reduce the noise typically present in data, even in situations where production inputs are not available. In fact, just augmenting or replacing training data with another copy of the training set, but with the predictions from the opaque model as targets, produced significantly more accurate and/or more compact regression trees.

查看原文本刊更多论文

准确的和可解释的回归树使用oracle教练

在许多现实世界的场景中，预测模型需要是可解释的，因此排除了许多已知的能够产生非常精确模型的机器学习技术，例如神经网络、支持向量机和所有集成方案。大多数情况下，会使用树模型或规则集，这通常会导致预测性能显著降低。oracle指导的总体目的是通过生成针对手头特定生产集优化的可解释模型来减少这种准确性与可理解性之间的权衡。该方法要求在生成预测模型时提供生产集输入，这一要求在大多数(但不是全部)预测建模场景中得到满足。在oracle教练中，首先从训练数据中推导出一个高度精确但不透明的模型。这个模型(“oracle”)然后被用来标记训练实例和生产实例。最后，使用结果数据集的不同组合来训练可解释模型。在本文中，oracle训练生成回归树，使用神经网络和随机森林作为oracle。使用32个公开可用的数据集进行的实验表明，与标准归纳相比，oracle指导可以显著提高预测性能。此外，它还表明，即使在生产输入不可用的情况下，高度精确的不透明模型也可以成功地用作预处理步骤，以减少数据中通常存在的噪声。事实上，只是用训练集的另一个副本增加或替换训练数据，但以不透明模型的预测作为目标，可以产生更准确和/或更紧凑的回归树。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

自引率

0.00%

发文量