一种增强的基于余弦的可视化技术，用于稳健的tweets数据聚类

Int. J. Intell. Comput. Cybern. Pub Date : 2021-02-02 DOI:10.1108/IJICC-10-2020-0151

K. Narasimhulu, K. MeenaAbarna, B. Sivakumar

{"title":"一种增强的基于余弦的可视化技术，用于稳健的tweets数据聚类","authors":"K. Narasimhulu, K. MeenaAbarna, B. Sivakumar","doi":"10.1108/IJICC-10-2020-0151","DOIUrl":null,"url":null,"abstract":"PurposeThe purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for achieving the robust tweets data clustering results.Design/methodology/approachLet “N” be the number of tweets documents for the topics extraction. Unwanted texts, punctuations and other symbols are removed, tokenization and stemming operations are performed in the initial tweets pre-processing step. Bag-of-features are determined for the tweets; later tweets are modelled with the obtained bag-of-features during the process of topics extraction. Approximation of topics features are extracted for every tweet document. These set of topics features of N documents are treated as multi-viewpoints. The key idea of the proposed work is to use multi-viewpoints in the similarity features computation. The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents (here N = 5) and corresponding documents are defined in projected space with five viewpoints, say, v1,v2, v3, v4, and v5. For example, similarity features between two documents (viewpoints v1, and v2) are computed concerning the other three multi-viewpoints (v3, v4, and v5), unlike a single viewpoint in traditional cosine metric.FindingsHealthcare problems with tweets data. Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding term frequency and inverse document frequency (TF–IDF) for unlabelled tweets.Originality/valueTopic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding TF-IDF for unlabelled tweets.","PeriodicalId":352072,"journal":{"name":"Int. J. Intell. Comput. Cybern.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"An enhanced cosine-based visual technique for the robust tweets data clustering\",\"authors\":\"K. Narasimhulu, K. MeenaAbarna, B. Sivakumar\",\"doi\":\"10.1108/IJICC-10-2020-0151\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeThe purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for achieving the robust tweets data clustering results.Design/methodology/approachLet “N” be the number of tweets documents for the topics extraction. Unwanted texts, punctuations and other symbols are removed, tokenization and stemming operations are performed in the initial tweets pre-processing step. Bag-of-features are determined for the tweets; later tweets are modelled with the obtained bag-of-features during the process of topics extraction. Approximation of topics features are extracted for every tweet document. These set of topics features of N documents are treated as multi-viewpoints. The key idea of the proposed work is to use multi-viewpoints in the similarity features computation. The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents (here N = 5) and corresponding documents are defined in projected space with five viewpoints, say, v1,v2, v3, v4, and v5. For example, similarity features between two documents (viewpoints v1, and v2) are computed concerning the other three multi-viewpoints (v3, v4, and v5), unlike a single viewpoint in traditional cosine metric.FindingsHealthcare problems with tweets data. Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding term frequency and inverse document frequency (TF–IDF) for unlabelled tweets.Originality/valueTopic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding TF-IDF for unlabelled tweets.\",\"PeriodicalId\":352072,\"journal\":{\"name\":\"Int. J. Intell. Comput. Cybern.\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Intell. Comput. Cybern.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1108/IJICC-10-2020-0151\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Intell. Comput. Cybern.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/IJICC-10-2020-0151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

本文的目的是研究获取推文文档之间更多信息的相似性特征所需的多视点，这有助于实现鲁棒的推文数据聚类结果。设计/方法/方法设“N”为用于主题提取的tweet文档的数量。去除不需要的文本、标点和其他符号，在初始tweet预处理步骤中执行标记化和词干操作。特征袋是为推文确定的;在主题提取过程中，利用得到的特征袋对后续推文进行建模。提取每个tweet文档的主题特征近似值。N个文档的这组主题特征被视为多视点。本文的核心思想是在相似特征计算中使用多视点。下图展示了5个tweet文档(这里N = 5)基于多视点的余弦相似度计算，对应的文档在投影空间中定义为5个视点，例如v1、v2、v3、v4和v5。例如，两个文档(视点v1和v2)之间的相似性特征是根据其他三个多视点(v3、v4和v5)计算的，这与传统余弦度量中的单个视点不同。用推特数据查找医疗保健问题。主题模型在与健康相关的推文分类中发挥着至关重要的作用，通过查找主题(或健康集群)而不是查找未标记推文的术语频率和逆文档频率(TF-IDF)。原创性/价值主题模型在寻找主题(或健康集群)而不是为未标记的推文寻找TF-IDF的健康推文分类中发挥着至关重要的作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An enhanced cosine-based visual technique for the robust tweets data clustering

PurposeThe purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for achieving the robust tweets data clustering results.Design/methodology/approachLet “N” be the number of tweets documents for the topics extraction. Unwanted texts, punctuations and other symbols are removed, tokenization and stemming operations are performed in the initial tweets pre-processing step. Bag-of-features are determined for the tweets; later tweets are modelled with the obtained bag-of-features during the process of topics extraction. Approximation of topics features are extracted for every tweet document. These set of topics features of N documents are treated as multi-viewpoints. The key idea of the proposed work is to use multi-viewpoints in the similarity features computation. The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents (here N = 5) and corresponding documents are defined in projected space with five viewpoints, say, v1,v2, v3, v4, and v5. For example, similarity features between two documents (viewpoints v1, and v2) are computed concerning the other three multi-viewpoints (v3, v4, and v5), unlike a single viewpoint in traditional cosine metric.FindingsHealthcare problems with tweets data. Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding term frequency and inverse document frequency (TF–IDF) for unlabelled tweets.Originality/valueTopic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding TF-IDF for unlabelled tweets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Intell. Comput. Cybern.

自引率

0.00%

发文量