{"title":"特征离散化对网络用户识别随机森林分类器精度的影响","authors":"A. A. Vorobeva","doi":"10.23919/FRUCT.2017.8071354","DOIUrl":null,"url":null,"abstract":"Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level.","PeriodicalId":114353,"journal":{"name":"2017 20th Conference of Open Innovations Association (FRUCT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Influence of features discretization on accuracy of random forest classifier for web user identification\",\"authors\":\"A. A. Vorobeva\",\"doi\":\"10.23919/FRUCT.2017.8071354\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level.\",\"PeriodicalId\":114353,\"journal\":{\"name\":\"2017 20th Conference of Open Innovations Association (FRUCT)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 20th Conference of Open Innovations Association (FRUCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/FRUCT.2017.8071354\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of Open Innovations Association (FRUCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FRUCT.2017.8071354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Influence of features discretization on accuracy of random forest classifier for web user identification
Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level.