一种基于在线树的非平稳高速数据流挖掘方法

Research initiative, treatment action : RITA Pub Date : 2020-01-15 DOI:10.22456/2175-2745.90822

Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo

{"title":"一种基于在线树的非平稳高速数据流挖掘方法","authors":"Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo","doi":"10.22456/2175-2745.90822","DOIUrl":null,"url":null,"abstract":"This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":"21 1","pages":"36-47"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams\",\"authors\":\"Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo\",\"doi\":\"10.22456/2175-2745.90822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.\",\"PeriodicalId\":82472,\"journal\":{\"name\":\"Research initiative, treatment action : RITA\",\"volume\":\"21 1\",\"pages\":\"36-47\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research initiative, treatment action : RITA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22456/2175-2745.90822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research initiative, treatment action : RITA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22456/2175-2745.90822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种新的从数据流中归纳决策树的学习算法。在这些领域中，随着时间的推移，大量的数据不断到达，可能是高速的。该算法采用自顶向下的归纳法构建树，递归地分割叶节点，直到不能扩展为止。新算法结合了树归纳法中的两种分割方法。第一种方法能够保证，在统计显著性的情况下，选择的每个分裂都与使用无限个示例选择的分裂相同。这样做的目的是确保在线诱导的树接近最优模型。然而，这种分割方法往往需要太多的样本来决定最佳分割，这延迟了在线预测学习模型精度的提高。因此，采用第二种方法可以更快地分割节点，加快树的生长速度。第二种分割方法是基于这样的观察，即更大的树能够存储更多关于训练示例的信息，并表示更复杂的概念。当有足够的证据时，第一种分裂方法也用于纠正第二种方法先前建议的分裂。最后，根据所提出的建议，在适当的统计显著性水平上重建树模型。将该算法与几种著名的从数据流中学习决策树的归纳算法进行了实证比较。在测试中，可以观察到所提出的算法在使用各种合成数据集和真实世界数据集的准确性和模型大小方面更具竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams

This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Research initiative, treatment action : RITA

自引率

0.00%

发文量