Comparison of clustering techniques for measuring similarity in articles

Usha Rani, Shashank Sahu
{"title":"Comparison of clustering techniques for measuring similarity in articles","authors":"Usha Rani, Shashank Sahu","doi":"10.1109/CIACT.2017.7977377","DOIUrl":null,"url":null,"abstract":"Clustering groups the objects into clusters having similarity with each other. This paper focuses on the two techniques of clustering i.e. hierarchical clustering and k-means clustering. The research is to compare various similarities measuring methods and finding out the best one. Research work is started by selecting different categories of textual contents or articles. For each selected category, articles have been selected from various news channels. Search words are identified which are most relevant for a respective category. Now these words are used as input for processing in the program to create a matrix of words. This matrix is then processed in Matlab using different measuring methods. The final outcome is demonstrated by the Cophenatic correlation coefficient & Silhouette Value to find out the best method of similarity measure. In this paper, five categories have been selected for the analysis which are “Business”, “Education”, “Election”, “Entertainment” and “Game” and 28 news articles have been filtered out for each category from various news channels. Different numbers of words are selected like 35, 49, 25, 30 and 35 against the mentioned categories for the implementation of the proposed technique. The research work finally concludes that for hierarchical clustering — ‘Cityblock’ and for k-means clustering — ‘Correlation’ is the best method however cityblock is at second position in the k-means clustering.","PeriodicalId":218079,"journal":{"name":"2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIACT.2017.7977377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Clustering groups the objects into clusters having similarity with each other. This paper focuses on the two techniques of clustering i.e. hierarchical clustering and k-means clustering. The research is to compare various similarities measuring methods and finding out the best one. Research work is started by selecting different categories of textual contents or articles. For each selected category, articles have been selected from various news channels. Search words are identified which are most relevant for a respective category. Now these words are used as input for processing in the program to create a matrix of words. This matrix is then processed in Matlab using different measuring methods. The final outcome is demonstrated by the Cophenatic correlation coefficient & Silhouette Value to find out the best method of similarity measure. In this paper, five categories have been selected for the analysis which are “Business”, “Education”, “Election”, “Entertainment” and “Game” and 28 news articles have been filtered out for each category from various news channels. Different numbers of words are selected like 35, 49, 25, 30 and 35 against the mentioned categories for the implementation of the proposed technique. The research work finally concludes that for hierarchical clustering — ‘Cityblock’ and for k-means clustering — ‘Correlation’ is the best method however cityblock is at second position in the k-means clustering.
文章相似性度量聚类技术的比较
聚类将对象聚到具有相似性的聚类中。本文主要讨论了两种聚类技术,即层次聚类和k-means聚类。研究的目的是比较各种相似度测量方法,找出最佳的相似度测量方法。研究工作是通过选择不同类别的文本内容或文章开始的。对于每个选定的类别,文章都是从不同的新闻频道中选择的。搜索词被识别为与各自类别最相关的词。现在这些单词被用作输入,在程序中进行处理,以创建一个单词矩阵。然后在Matlab中使用不同的测量方法对该矩阵进行处理。最后用相关系数和剪影值对结果进行论证,找出最佳的相似性度量方法。本文选择了“商业”、“教育”、“选举”、“娱乐”和“游戏”五个类别进行分析,并从各个新闻频道中为每个类别过滤出28篇新闻报道。针对上述类别选择不同数量的单词,如35、49、25、30和35,以实现所提出的技术。研究最终得出结论:对于分层聚类,“Cityblock”和k-means聚类,“相关性”是最好的聚类方法,而“Cityblock”在k-means聚类中排名第二。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信