Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo
{"title":"一种基于在线树的非平稳高速数据流挖掘方法","authors":"Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo","doi":"10.22456/2175-2745.90822","DOIUrl":null,"url":null,"abstract":"This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams\",\"authors\":\"Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo\",\"doi\":\"10.22456/2175-2745.90822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.\",\"PeriodicalId\":82472,\"journal\":{\"name\":\"Research initiative, treatment action : RITA\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research initiative, treatment action : RITA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22456/2175-2745.90822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research initiative, treatment action : RITA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22456/2175-2745.90822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams
This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.