INFORMATION TECHNOLOGY FOR STATISTICAL CLUSTER ANALYSIS OF INFORMATION IN COMPLEX NETWORKS

Computer systems and information technologies Pub Date : 2022-12-29 DOI:10.31891/csit-2022-4-7

Oksana Kyrychenko

{"title":"INFORMATION TECHNOLOGY FOR STATISTICAL CLUSTER ANALYSIS OF INFORMATION IN COMPLEX NETWORKS","authors":"Oksana Kyrychenko","doi":"10.31891/csit-2022-4-7","DOIUrl":null,"url":null,"abstract":"Information technology has been developed, which is used to collect, process and save large volumes of data from the web space. With the help of technology, the statistical characteristics of various segments of the web space and their cluster structure are studied. Two methods are used to find the optimal number of clusters and cluster centers: the well-known k-core decomposition algorithm and a new method developed by the authors. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. The clustering process is carried out using the Power iteration clustering algorithm. \nWith the help of written software (crawler), information is collected on a given segment of the web space. For the studied area, there are statistical characteristics, namely: node degree, clustering coefficient, node probability distributions by input and output connections. Oriented and unoriented graphs of web pages of the studied zones are constructed. By combining the calculated dependencies for the input and output subnets, we can obtain the statistical characteristics of the undirected graphs of the web pages of the web space zones that we are investigating. \nFor cluster analysis, the optimal number of clusters and cluster centers can be found in 2 ways: by the well-known k-core decomposition algorithm and by using a new method developed by the author. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. Using the Rower iteration clustering algorithm, the cluster structure of various segments of the web space is studied. \nThe advantage of the developed information technology is that with its help one can work with large sets of data collected on the Internet, study their structure and statistical characteristics, and perform the clustering process. To implement the clustering process and find the optimal number of clusters and centroids a new algorithm is suggested. The results of the algorithm indicate high accuracy in determining the optimal number of clusters.","PeriodicalId":353631,"journal":{"name":"Computer systems and information technologies","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer systems and information technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31891/csit-2022-4-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Information technology has been developed, which is used to collect, process and save large volumes of data from the web space. With the help of technology, the statistical characteristics of various segments of the web space and their cluster structure are studied. Two methods are used to find the optimal number of clusters and cluster centers: the well-known k-core decomposition algorithm and a new method developed by the authors. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. The clustering process is carried out using the Power iteration clustering algorithm. With the help of written software (crawler), information is collected on a given segment of the web space. For the studied area, there are statistical characteristics, namely: node degree, clustering coefficient, node probability distributions by input and output connections. Oriented and unoriented graphs of web pages of the studied zones are constructed. By combining the calculated dependencies for the input and output subnets, we can obtain the statistical characteristics of the undirected graphs of the web pages of the web space zones that we are investigating. For cluster analysis, the optimal number of clusters and cluster centers can be found in 2 ways: by the well-known k-core decomposition algorithm and by using a new method developed by the author. The new algorithm is based on the distribution of eigenvalues of the stochastic matrix, which describes the process of Markov transitions in the system. Using the Rower iteration clustering algorithm, the cluster structure of various segments of the web space is studied. The advantage of the developed information technology is that with its help one can work with large sets of data collected on the Internet, study their structure and statistical characteristics, and perform the clustering process. To implement the clustering process and find the optimal number of clusters and centroids a new algorithm is suggested. The results of the algorithm indicate high accuracy in determining the optimal number of clusters.

查看原文本刊更多论文

复杂网络中信息统计聚类分析的信息技术

信息技术已经发展起来，用于收集、处理和保存来自网络空间的大量数据。借助技术手段，研究了网络空间各环节的统计特征及其聚类结构。本文使用了两种方法来寻找最优的聚类数和聚类中心:著名的k-core分解算法和作者开发的一种新方法。该算法基于描述系统马尔可夫变换过程的随机矩阵的特征值分布。聚类过程采用Power迭代聚类算法进行。在编写的软件(爬虫)的帮助下，在给定的网络空间段上收集信息。对于研究区域，存在统计特征，即:节点度、聚类系数、节点概率按输入和输出连接的分布。构造了研究区网页的有向图和无向图。通过结合计算的输入和输出子网的依赖关系，我们可以获得我们正在研究的网络空间区域的网页的无向图的统计特征。对于聚类分析，可以通过两种方法找到最优的聚类数量和聚类中心:通过著名的k-core分解算法和使用作者开发的新方法。该算法基于描述系统马尔可夫变换过程的随机矩阵的特征值分布。采用Rower迭代聚类算法，研究了网络空间各段的聚类结构。发达的信息技术的优势在于，在它的帮助下，人们可以处理从互联网上收集的大量数据集，研究它们的结构和统计特征，并执行聚类过程。为了实现聚类过程并找到最优聚类数和质心数，提出了一种新的算法。结果表明，该算法在确定最佳聚类数方面具有较高的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer systems and information technologies

自引率

0.00%

发文量