{"title":"Influence of features discretization on accuracy of random forest classifier for web user identification","authors":"A. A. Vorobeva","doi":"10.23919/FRUCT.2017.8071354","DOIUrl":null,"url":null,"abstract":"Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level.","PeriodicalId":114353,"journal":{"name":"2017 20th Conference of Open Innovations Association (FRUCT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of Open Innovations Association (FRUCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FRUCT.2017.8071354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level.