{"title":"Multi-level K-means text clustering technique for topic identification for competitor intelligence","authors":"Swapnajit Chakraborti, S. Dey","doi":"10.1109/RCIS.2016.7549332","DOIUrl":null,"url":null,"abstract":"Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.","PeriodicalId":344289,"journal":{"name":"2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RCIS.2016.7549332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.