Multi-level K-means text clustering technique for topic identification for competitor intelligence

Swapnajit Chakraborti, S. Dey
{"title":"Multi-level K-means text clustering technique for topic identification for competitor intelligence","authors":"Swapnajit Chakraborti, S. Dey","doi":"10.1109/RCIS.2016.7549332","DOIUrl":null,"url":null,"abstract":"Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.","PeriodicalId":344289,"journal":{"name":"2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RCIS.2016.7549332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.
竞争对手情报主题识别的多层次k均值文本聚类技术
网络作为一种易于访问的信息资源,其扩散导致许多公司从互联网上收集竞争对手的情报。虽然从互联网上收集这些信息很容易,但为了供商业决策者阅读,对这些信息进行整理和结构化是一个真正的麻烦。基于文本聚类的主题识别技术有望在此类应用中发挥重要作用。使用适当的聚类技术,从网络上收集的竞争对手情报语料库可以分为主题组,因此管理人员对这些信息的分析变得相对容易。本文研究了标准K-means文本聚类算法的有效性,该算法以自上而下、分而治之的方式应用于竞争对手的情报语料库,这些语料库来自网络上的公开来源,如新闻、博客、研究论文等。本文还证明了多级k均值聚类技术在聚类过程中确定最优聚类数量的能力。还解释了用于确定集群质量的集群有效性度量以及其他用户控制的配置参数。经验发现,ML-KM技术还解决了独立标准K-means (S-KM)的一个问题,即它偏向于凸球形簇,导致较大的簇包含较小的簇。ML-KM在检测较小的聚类方面优于独立的S-KM,这使得它更适合于聚类竞争对手情报相关的文本语料库,在这些语料库中,较小的小聚类实际上可以导致重要的发现。本文给出了基于竞争对手情报语料库和标准路透社语料库的ML-KM聚类技术和独立S-KM聚类技术的实验结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信