基于hadoop的海量短文本聚类算法研究

... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging Pub Date : 2019-07-31 DOI:10.1117/12.2540380

Qiang Zhao, Yuliang Shi, Zepeng Qing

{"title":"基于hadoop的海量短文本聚类算法研究","authors":"Qiang Zhao, Yuliang Shi, Zepeng Qing","doi":"10.1117/12.2540380","DOIUrl":null,"url":null,"abstract":"Many clustering algorithms work well on small data sets of less than 200 data objects. However, a large database may contain millions of objects, and clustering on such a large data set may lead to biased results. As data volumes and availability continue to grow, so does the need for large dataset analytics. Among the most commonly used clustering algorithms, K-means proved to be one of the most popular choices to provide acceptable results in a reasonable amount of time. In this paper, we present an improved k-means algorithm with better initial centroids. Also, we implement this modified algorithm on Hadoop platform. Experiments show that the improved k-means algorithm converges faster than the classic k-means and the average execution time is reduced compared to the traditional k-means.","PeriodicalId":90079,"journal":{"name":"... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging","volume":"26 1","pages":"111980A - 111980A-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Research on Hadoop-based massive short text clustering algorithm\",\"authors\":\"Qiang Zhao, Yuliang Shi, Zepeng Qing\",\"doi\":\"10.1117/12.2540380\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many clustering algorithms work well on small data sets of less than 200 data objects. However, a large database may contain millions of objects, and clustering on such a large data set may lead to biased results. As data volumes and availability continue to grow, so does the need for large dataset analytics. Among the most commonly used clustering algorithms, K-means proved to be one of the most popular choices to provide acceptable results in a reasonable amount of time. In this paper, we present an improved k-means algorithm with better initial centroids. Also, we implement this modified algorithm on Hadoop platform. Experiments show that the improved k-means algorithm converges faster than the classic k-means and the average execution time is reduced compared to the traditional k-means.\",\"PeriodicalId\":90079,\"journal\":{\"name\":\"... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging\",\"volume\":\"26 1\",\"pages\":\"111980A - 111980A-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2540380\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2540380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

许多聚类算法在少于200个数据对象的小数据集上工作得很好。然而，大型数据库可能包含数百万个对象，对如此大的数据集进行聚类可能会导致有偏差的结果。随着数据量和可用性的不断增长，对大型数据集分析的需求也在不断增长。在最常用的聚类算法中，K-means被证明是在合理的时间内提供可接受结果的最流行的选择之一。本文提出了一种改进的k-means算法，该算法具有更好的初始质心。并在Hadoop平台上实现了改进后的算法。实验表明，改进的k-means算法收敛速度比经典的k-means算法快，平均执行时间比传统的k-means算法缩短。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Research on Hadoop-based massive short text clustering algorithm

Many clustering algorithms work well on small data sets of less than 200 data objects. However, a large database may contain millions of objects, and clustering on such a large data set may lead to biased results. As data volumes and availability continue to grow, so does the need for large dataset analytics. Among the most commonly used clustering algorithms, K-means proved to be one of the most popular choices to provide acceptable results in a reasonable amount of time. In this paper, we present an improved k-means algorithm with better initial centroids. Also, we implement this modified algorithm on Hadoop platform. Experiments show that the improved k-means algorithm converges faster than the classic k-means and the average execution time is reduced compared to the traditional k-means.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

... International Workshop on Pattern Recognition in NeuroImaging. International Workshop on Pattern Recognition in NeuroImaging

自引率

0.00%

发文量