A. Kurtukova, A. Romanov, A. Fedotova, A. Shelupanov
{"title":"基于遗传算法的机器学习方法和特征选择在解决用于网络安全的俄语文本作者确定问题中的应用","authors":"A. Kurtukova, A. Romanov, A. Fedotova, A. Shelupanov","doi":"10.21293/1818-0442-2021-25-1-79-85","DOIUrl":null,"url":null,"abstract":"The article explores the approaches to determine the author of a natural language text, the advantages and disadvantages of these approaches. The identification is carried out using classical machine learning algorithms and neural network architectures (including fastText, CNN and LSTM and their hybrids, BERT). The efficiency of the model is evaluated based on the social media texts dataset. A separate experiment is devoted to the feature selection using a genetic algorithm. SVM trained on a selected 400 features set makes it possible to achieve up to 10% increase in accuracy for all considered numbers of authors. Neural networks achieve a classification accuracy of 96%, but their training time in some cases exceeds the time spent on training SVM and other classical machine learning methods in some cases. For SVM together with the genetic algorithm, the average accuracy was 66%, for deep neural networks and fastText – 73 and 68%, respectively.","PeriodicalId":273068,"journal":{"name":"Proceedings of Tomsk State University of Control Systems and Radioelectronics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of machine learning methods and feature selection based on a genetic algorithm in solving the problem of determining the authorship of a Russian-language text for cybersecurity\",\"authors\":\"A. Kurtukova, A. Romanov, A. Fedotova, A. Shelupanov\",\"doi\":\"10.21293/1818-0442-2021-25-1-79-85\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article explores the approaches to determine the author of a natural language text, the advantages and disadvantages of these approaches. The identification is carried out using classical machine learning algorithms and neural network architectures (including fastText, CNN and LSTM and their hybrids, BERT). The efficiency of the model is evaluated based on the social media texts dataset. A separate experiment is devoted to the feature selection using a genetic algorithm. SVM trained on a selected 400 features set makes it possible to achieve up to 10% increase in accuracy for all considered numbers of authors. Neural networks achieve a classification accuracy of 96%, but their training time in some cases exceeds the time spent on training SVM and other classical machine learning methods in some cases. For SVM together with the genetic algorithm, the average accuracy was 66%, for deep neural networks and fastText – 73 and 68%, respectively.\",\"PeriodicalId\":273068,\"journal\":{\"name\":\"Proceedings of Tomsk State University of Control Systems and Radioelectronics\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of Tomsk State University of Control Systems and Radioelectronics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21293/1818-0442-2021-25-1-79-85\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of Tomsk State University of Control Systems and Radioelectronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21293/1818-0442-2021-25-1-79-85","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Application of machine learning methods and feature selection based on a genetic algorithm in solving the problem of determining the authorship of a Russian-language text for cybersecurity
The article explores the approaches to determine the author of a natural language text, the advantages and disadvantages of these approaches. The identification is carried out using classical machine learning algorithms and neural network architectures (including fastText, CNN and LSTM and their hybrids, BERT). The efficiency of the model is evaluated based on the social media texts dataset. A separate experiment is devoted to the feature selection using a genetic algorithm. SVM trained on a selected 400 features set makes it possible to achieve up to 10% increase in accuracy for all considered numbers of authors. Neural networks achieve a classification accuracy of 96%, but their training time in some cases exceeds the time spent on training SVM and other classical machine learning methods in some cases. For SVM together with the genetic algorithm, the average accuracy was 66%, for deep neural networks and fastText – 73 and 68%, respectively.