Nick Odegov, Matin Hadzhyiev, Liudmyla Bukata, Liudmyla Glazunova, Marina Kochetkova
{"title":"大数据分类算法的仿真比较","authors":"Nick Odegov, Matin Hadzhyiev, Liudmyla Bukata, Liudmyla Glazunova, Marina Kochetkova","doi":"10.36994/2788-5518-2023-01-05-15","DOIUrl":null,"url":null,"abstract":"With the development of information transmission and storage technologies, the volumes of data that require processing and analysis are growing rapidly. Therefore, the task of developing algorithms for solving various artificial intelligence problems for Big Data volumes is urgent. In our works, this informal term \"Big Data\" refers to situations when known processing algorithms do not allow solving a problem in a practically acceptable time. With regard to classification tasks, such conditions are possible when the first place is not even high reliability (that is, the minimum number of errors), but productivity (classification speed). The well-known method of nearest neighbors is one of the most productive. However, the indicator of the order of growth (the number of typical operations) for it is K x M x N, where K is the number of nearest neighbors, M is the number of classes, N is the typical number of class elements. Along with this, we propose to consider algorithms based on the principles of M-means, where classes are replaced by only a small number of their characteristics. Among such algorithms, the article considers: the algorithm of class centers and the algorithm of adaptive rules. The order of growth for these algorithms is only M according to the number of classes. The comparative analysis of these algorithms is performed by the method of simulation modeling. Simulation models are implemented by the Adaptive Metrics program, developed at the Department of Software Engineering at DUITZ. In this program, the classification problem is solved using the example of the dichotomy problem for classes A and B. The program has the possibility of very flexible setting of models. Problems can be solved in 1-dimensional, 2-dimensional,..., 6-dimensional spaces. The distribution of factor values for classes A and B can have quite different statistical characteristics - from uniform and triangular distribution functions to functions approaching a normal distribution. The graphical interface of the program allows you to dynamically observe the solution of the classification problem in one-dimensional, two-dimensional and 6-dimensional projections. As a result of multiple runs of the program, it was established that the algorithms of the nearest neighbors slightly outperform the algorithms of class centers and adaptive rules according to the criterion of reliability, and also comply with the principle of compactness (concentration of the largest number of erroneous solutions in the hypercube of errors). Algorithms based on M-means principles significantly outperform this algorithm in terms of performance. Also, the algorithm of adaptive rules best corresponds to the principle of equality of classes and is the most productive of the considered ones.","PeriodicalId":165726,"journal":{"name":"Інфокомунікаційні та комп’ютерні технології","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COMPARISON OF BIG DATA CLASSIFICATION ALGORITHMS BY SIMULATION METHODS\",\"authors\":\"Nick Odegov, Matin Hadzhyiev, Liudmyla Bukata, Liudmyla Glazunova, Marina Kochetkova\",\"doi\":\"10.36994/2788-5518-2023-01-05-15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the development of information transmission and storage technologies, the volumes of data that require processing and analysis are growing rapidly. Therefore, the task of developing algorithms for solving various artificial intelligence problems for Big Data volumes is urgent. In our works, this informal term \\\"Big Data\\\" refers to situations when known processing algorithms do not allow solving a problem in a practically acceptable time. With regard to classification tasks, such conditions are possible when the first place is not even high reliability (that is, the minimum number of errors), but productivity (classification speed). The well-known method of nearest neighbors is one of the most productive. However, the indicator of the order of growth (the number of typical operations) for it is K x M x N, where K is the number of nearest neighbors, M is the number of classes, N is the typical number of class elements. Along with this, we propose to consider algorithms based on the principles of M-means, where classes are replaced by only a small number of their characteristics. Among such algorithms, the article considers: the algorithm of class centers and the algorithm of adaptive rules. The order of growth for these algorithms is only M according to the number of classes. The comparative analysis of these algorithms is performed by the method of simulation modeling. Simulation models are implemented by the Adaptive Metrics program, developed at the Department of Software Engineering at DUITZ. In this program, the classification problem is solved using the example of the dichotomy problem for classes A and B. The program has the possibility of very flexible setting of models. Problems can be solved in 1-dimensional, 2-dimensional,..., 6-dimensional spaces. The distribution of factor values for classes A and B can have quite different statistical characteristics - from uniform and triangular distribution functions to functions approaching a normal distribution. The graphical interface of the program allows you to dynamically observe the solution of the classification problem in one-dimensional, two-dimensional and 6-dimensional projections. As a result of multiple runs of the program, it was established that the algorithms of the nearest neighbors slightly outperform the algorithms of class centers and adaptive rules according to the criterion of reliability, and also comply with the principle of compactness (concentration of the largest number of erroneous solutions in the hypercube of errors). Algorithms based on M-means principles significantly outperform this algorithm in terms of performance. Also, the algorithm of adaptive rules best corresponds to the principle of equality of classes and is the most productive of the considered ones.\",\"PeriodicalId\":165726,\"journal\":{\"name\":\"Інфокомунікаційні та комп’ютерні технології\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Інфокомунікаційні та комп’ютерні технології\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.36994/2788-5518-2023-01-05-15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Інфокомунікаційні та комп’ютерні технології","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36994/2788-5518-2023-01-05-15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
随着信息传输和存储技术的发展,需要处理和分析的数据量迅速增长。因此,开发算法来解决大数据量下的各种人工智能问题是迫在眉睫的任务。在我们的工作中,这个非正式术语“大数据”是指已知的处理算法不允许在实际可接受的时间内解决问题的情况。对于分类任务,当首要考虑的不是高可靠性(即最少的错误数),而是生产率(分类速度)时,就有可能出现这种情况。众所周知的最近邻法是最有成效的方法之一。然而,它的增长顺序(典型操作次数)的指标是K x M x N,其中K是最近邻居的数量,M是类的数量,N是类元素的典型数量。与此同时,我们建议考虑基于M-means原则的算法,其中类仅被其少量特征所取代。在这些算法中,本文研究了类中心算法和自适应规则算法。根据类的数量,这些算法的增长阶数仅为M。通过仿真建模的方法对这些算法进行了对比分析。仿真模型由DUITZ软件工程系开发的自适应度量程序实现。在本程序中,以A类和b类的二分问题为例来解决分类问题,该程序具有非常灵活设置模型的可能性。问题可以在一维、二维、……, 6维空间。A类和B类的因子值分布可能具有完全不同的统计特征——从均匀和三角形分布函数到接近正态分布的函数。该程序的图形界面允许您在一维,二维和6维投影中动态观察分类问题的解决方案。通过对程序的多次运行,确定了最近邻算法在可靠性准则上略优于类中心算法和自适应规则算法,并且符合紧凑性原则(错误超立方中最大错误解的集中)。基于M-means原理的算法在性能上明显优于该算法。自适应规则算法最符合类的平等原则,是所有算法中效率最高的。
COMPARISON OF BIG DATA CLASSIFICATION ALGORITHMS BY SIMULATION METHODS
With the development of information transmission and storage technologies, the volumes of data that require processing and analysis are growing rapidly. Therefore, the task of developing algorithms for solving various artificial intelligence problems for Big Data volumes is urgent. In our works, this informal term "Big Data" refers to situations when known processing algorithms do not allow solving a problem in a practically acceptable time. With regard to classification tasks, such conditions are possible when the first place is not even high reliability (that is, the minimum number of errors), but productivity (classification speed). The well-known method of nearest neighbors is one of the most productive. However, the indicator of the order of growth (the number of typical operations) for it is K x M x N, where K is the number of nearest neighbors, M is the number of classes, N is the typical number of class elements. Along with this, we propose to consider algorithms based on the principles of M-means, where classes are replaced by only a small number of their characteristics. Among such algorithms, the article considers: the algorithm of class centers and the algorithm of adaptive rules. The order of growth for these algorithms is only M according to the number of classes. The comparative analysis of these algorithms is performed by the method of simulation modeling. Simulation models are implemented by the Adaptive Metrics program, developed at the Department of Software Engineering at DUITZ. In this program, the classification problem is solved using the example of the dichotomy problem for classes A and B. The program has the possibility of very flexible setting of models. Problems can be solved in 1-dimensional, 2-dimensional,..., 6-dimensional spaces. The distribution of factor values for classes A and B can have quite different statistical characteristics - from uniform and triangular distribution functions to functions approaching a normal distribution. The graphical interface of the program allows you to dynamically observe the solution of the classification problem in one-dimensional, two-dimensional and 6-dimensional projections. As a result of multiple runs of the program, it was established that the algorithms of the nearest neighbors slightly outperform the algorithms of class centers and adaptive rules according to the criterion of reliability, and also comply with the principle of compactness (concentration of the largest number of erroneous solutions in the hypercube of errors). Algorithms based on M-means principles significantly outperform this algorithm in terms of performance. Also, the algorithm of adaptive rules best corresponds to the principle of equality of classes and is the most productive of the considered ones.