根据主题对文本文档进行分类的K-nn最近邻方法

IF 0.2 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Radio Electronics Computer Science Control Pub Date : 2023-10-13 DOI:10.15588/1607-3274-2023-3-9

N. I. Boyko, V. Yu. Mykhailyshyn

{"title":"根据主题对文本文档进行分类的K-nn最近邻方法","authors":"N. I. Boyko, V. Yu. Mykhailyshyn","doi":"10.15588/1607-3274-2023-3-9","DOIUrl":null,"url":null,"abstract":"Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method. Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy. Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score. Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model. Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"56 1","pages":"0"},"PeriodicalIF":0.2000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS\",\"authors\":\"N. I. Boyko, V. Yu. Mykhailyshyn\",\"doi\":\"10.15588/1607-3274-2023-3-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method. Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy. Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score. Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model. Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2023-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2023-3-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2023-3-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

上下文。基于主题的文本文档分类的最近邻k-NN方法优化及实验解决。目标。本研究旨在研究基于主题的最近邻k-NN文本文档分类方法。该研究的任务是在最佳时间和高精度的情况下，基于数据集按主题对文本文档进行分类。方法。k近邻(k-NN)方法是一种用于自动对象分类或回归的度量算法。k-NN算法存储所有现有数据，并根据新点与训练集中所有点的距离对新点进行分类。为此，使用一定的距离度量，如欧几里得距离。在学习过程中，k-NN存储来自训练集的所有数据，因此它属于“懒惰”算法，因为学习是在分类时进行的。该算法对数据的分布没有任何假设，是非参数的。k- nn算法的任务是根据训练数据集中最近邻的类别k为测试文档x分配一个特定的类别。测试文档x和每个最近邻之间的相似度是由近邻所属的类别打分的。如果k的几个最近邻属于同一类别，则测试文档x的该类别的相似度分数计算为每个最近邻的类别分数之和。之后，按分数对类别进行排序，并将测试文档分配到得分最高的类别。结果。本文成功地实现了基于k-NN的文本文档分类方法。对影响k-NN效率的各种方法进行了实验，例如算法和度量的选择。实验结果表明，采用一定的方法可以提高分类的精度和模型的效率。结论。在不同的度量和算法上显示结果表明，选择特定的算法和度量可以对预测的准确性产生重大影响。球树算法的应用，以及使用不同的度量，如曼哈顿或欧几里得距离，可以导致改进的结果。在应用k-NN之前使用聚类已被证明对结果有积极影响，并允许更好地分组数据，减少噪声或错误分类点的影响，从而提高准确性和类别分布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS

Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method. Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy. Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score. Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model. Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

自引率

20.00%

发文量

审稿时长

12 weeks