Comparative Analysis of Improved Dirichlet Process Mixture Model

IF 0.8 Q3 MULTIDISCIPLINARY SCIENCES
Lili Wu, P. Fam, Majid Khan Majahar Ali, Ying Tian, Mohd. Tahir Ismail, Siti Zulaikha Mohd Jamaludin
{"title":"Comparative Analysis of Improved Dirichlet Process Mixture Model","authors":"Lili Wu, P. Fam, Majid Khan Majahar Ali, Ying Tian, Mohd. Tahir Ismail, Siti Zulaikha Mohd Jamaludin","doi":"10.11113/mjfas.v19n6.3062","DOIUrl":null,"url":null,"abstract":"Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.","PeriodicalId":18149,"journal":{"name":"Malaysian Journal of Fundamental and Applied Sciences","volume":"6 4","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Fundamental and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/mjfas.v19n6.3062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.
改进的德里赫特过程混合模型比较分析
由于信息技术的发展,工程、医疗、金融、异常检测、图像识别、人工智能等各行各业每天都会产生大量的数据。如此庞大的数据对准确和适当的分类进行分析提出了挑战。传统的聚类方法需要指定聚类个数,且多基于距离,不能有效考虑高维多源数据不同指标之间的相关性。而且,当有新数据产生时,集群的数量不能自动调整。为了提高大数据环境下高维多源数据的聚类分析能力,本研究采用基于分布聚类的非参数混合模型,该模型不需要指定聚类个数,并且可以随数据自动更新。将主成分分析(PCA)、t分布随机邻居嵌入(t-SNE)和非参数贝叶斯方法Dirichlet过程混合模型(DPMM)相结合,提出了贝叶斯非参数PCA模型(PCA-DPMM)和贝叶斯非参数t-SNE模型(TSNE-DPMM)。通过引入有限正态混合分布,采用DPMM中餐馆过程进行抽样。对虹膜数据集的聚类结果进行了比较和分析。DPMM和tsn -DPMM的精度达到0.97,而PCA-DPMM的最大精度仅为0.94。设置不同迭代次数时,tsn -DPMM的精度范围为0.92 ~ 0.97,DPMM的精度范围为0.66 ~ 0.97,PCA-DPMM的精度范围为0.73 ~ 0.94。因此,本文提出的TSNE-DPMM在聚类结果中保证了准确性和更好的模型稳定性。未来的研究可以通过结合深度学习算法等探索模型的改进,进一步提高其性能。此外,将tsn - dpmm模型应用于其他领域的数据分析也是未来的研究方向。通过这些努力,我们可以更好地应对大数据环境下高维、多源数据分析的挑战,并从中提取有价值的信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
45
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信