{"title":"基于聚类算法的高斯朴素贝叶斯数据分类模型","authors":"Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, Min Wang","doi":"10.2991/masta-19.2019.67","DOIUrl":null,"url":null,"abstract":"A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168","PeriodicalId":103896,"journal":{"name":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Gaussian Naive Bayesian Data Classification Model Based on Clustering Algorithm\",\"authors\":\"Zeng-jun Bi, Yao-quan Han, Cai-quan Huang, Min Wang\",\"doi\":\"10.2991/masta-19.2019.67\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168\",\"PeriodicalId\":103896,\"journal\":{\"name\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2991/masta-19.2019.67\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2991/masta-19.2019.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Gaussian Naive Bayesian Data Classification Model Based on Clustering Algorithm
A gaussian naive bayesian data classification model based on clustering algorithm was proposed for fast recognition and classification of unknown continuous data containing a large number of non-priori knowledge. Firstly, the unknown data were extracted from the representative samples according to the information entropy measure for clustering to generate class labels. Then, the mapping relationship between data and class labels was established by using the gaussian naive bayes algorithm, and the classification model was obtained through training. Simulation results show that this unsupervised analysis process has a good classification effect on new data. Introduction Classification is an important part of data mining. By learning training data, the mapping relationship between training data and predefined classes can be established[1]. In order to make the traditional classification algorithm classify data well without predetermined classification for learning semi-supervised or even unsupervised methods are used to improve the classification algorithm[2]. Literature [3] uses semi-supervised naive bayes classification algorithm to establish initial classification for a small number of data sets with class labels, and continuously updates the data with high classification accuracy to the training set when predicting and classifying the data without labels, so as to realize semi-supervised learning of data classification. However, this algorithm fails to fundamentally realize the unsupervised generation of class labels of data to be classified, and prior knowledge still plays a crucial role in the training of classification algorithm. Clustering is an unsupervised process in which the most similar objects are divided into a class based on the objects found in the data and their relationships[4,5]; Literature [6] applies unsupervised clustering to text clustering and constructs an automatic text classification model based on vector space model. However, the model is not suitable for the classification of continuous variables. Therefore, this paper combines the clustering algorithm with the gaussian naive bayes classification algorithm, and proposes an unsupervised classification model suitable for continuous variable data. In this method, small representative samples are extracted from large samples by information entropy theory, and prediction classes of observation data are generated by clustering algorithm as predefined target classes of classification algorithm, so that data are classified and prediction models are established without prior knowledge. Simulation results show that this model is efficient in classifying and processing new data, and only a small part of sample extraction is needed to train the classification model of the whole data, which greatly saves computing resources and time. Selection of Clustering Algorithm Classical clustering algorithm can be divided into hierarchical clustering algorithm, divide-based clustering algorithm and density-based clustering algorithm. The corresponding representative classical algorithms are k-means, condensed hierarchical clustering algorithm and DBSCAN. Clustering performance measurement measures the performance of clustering algorithms under different environments according to the accuracy, consistency and other indicators of various clustering algorithms for sample division. ARI index is used to measure the consistency between the data label calculated by the clustering algorithm and the original label. The expression is: International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168