基于聚类算法的高斯朴素贝叶斯数据分类模型

Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Pub Date : 1900-01-01 DOI:10.2991/masta-19.2019.67

Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, Min Wang

{"title":"基于聚类算法的高斯朴素贝叶斯数据分类模型","authors":"Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, Min Wang","doi":"10.2991/masta-19.2019.67","DOIUrl":null,"url":null,"abstract":"A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168","PeriodicalId":103896,"journal":{"name":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Gaussian Naive Bayesian Data Classification Model Based on Clustering Algorithm\",\"authors\":\"Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, Min Wang\",\"doi\":\"10.2991/masta-19.2019.67\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168\",\"PeriodicalId\":103896,\"journal\":{\"name\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2991/masta-19.2019.67\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2991/masta-19.2019.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

针对含有大量非先验知识的未知连续数据的快速识别和分类问题，提出了一种基于聚类算法的高斯朴素贝叶斯数据分类模型。首先，根据信息熵度量从代表性样本中提取未知数据进行聚类，生成类标签;然后，利用高斯朴素贝叶斯算法建立数据与类标号之间的映射关系，通过训练得到分类模型;仿真结果表明，该无监督分析过程对新数据具有良好的分类效果。分类是数据挖掘的重要组成部分。通过学习训练数据，可以建立训练数据与预定义类之间的映射关系[1]。为了使传统的分类算法在没有预定分类的情况下能够很好地对数据进行分类，人们采用半监督甚至无监督的方法对分类算法进行改进[2]。文献[3]采用半监督朴素贝叶斯分类算法，对少数有类标签的数据集建立初始分类，对无标签的数据进行预测和分类时，不断将分类精度高的数据更新到训练集，实现数据分类的半监督学习。然而，该算法并没有从根本上实现待分类数据类标签的无监督生成，先验知识在分类算法的训练中仍然起着至关重要的作用。聚类是一种无监督的过程，其中根据数据中发现的对象及其关系将最相似的对象划分为一类[4,5];文献[6]将无监督聚类应用于文本聚类，构建了基于向量空间模型的文本自动分类模型。然而，该模型不适合连续变量的分类。因此，本文将聚类算法与高斯朴素贝叶斯分类算法相结合，提出了一种适用于连续变量数据的无监督分类模型。该方法利用信息熵理论从大样本中提取具有代表性的小样本，通过聚类算法生成观测数据的预测类作为分类算法的预定义目标类，从而在不需要先验知识的情况下对数据进行分类，建立预测模型。仿真结果表明，该模型对新数据的分类和处理效率高，只需要提取一小部分样本就可以训练整个数据的分类模型，大大节省了计算资源和时间。经典聚类算法可分为层次聚类算法、基于划分的聚类算法和基于密度的聚类算法。具有代表性的经典算法有k-means算法、精简层次聚类算法和DBSCAN算法。聚类性能度量是根据各种聚类算法对样本划分的准确性、一致性等指标来衡量聚类算法在不同环境下的性能。ARI指数用来衡量聚类算法计算出的数据标签与原始标签的一致性。建模、分析、仿真技术与应用国际会议(MASTA 2019)版权所有©2019，作者。亚特兰蒂斯出版社出版。这是一篇基于CC BY-NC许可(http://creativecommons.org/licenses/by-nc/4.0/)的开放获取文章。智能系统研究进展，第168卷

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gaussian Naive Bayesian Data Classification Model Based on Clustering Algorithm

A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)

自引率

0.00%

发文量