基于密度的Twitter空间文本聚类

Minh D. Nguyen, Won-Yong Shin
{"title":"基于密度的Twitter空间文本聚类","authors":"Minh D. Nguyen, Won-Yong Shin","doi":"10.1145/3110025.3110096","DOIUrl":null,"url":null,"abstract":"Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN's approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio--textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.","PeriodicalId":399660,"journal":{"name":"Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"DBSTexC: Density-Based Spatio-Textual Clustering on Twitter\",\"authors\":\"Minh D. Nguyen, Won-Yong Shin\",\"doi\":\"10.1145/3110025.3110096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN's approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio--textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.\",\"PeriodicalId\":399660,\"journal\":{\"name\":\"Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3110025.3110096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3110025.3110096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

基于密度的噪声空间聚类(DBSCAN)是最常用的基于密度的聚类算法,它可以发现具有任意形状的多个聚类。当输入数据类型同质时,DBSCAN可以正常工作,但是当输入数据集具有文本异质性时(例如,当我们打算从社交媒体上与某个兴趣点(POI)相关的地理标记帖子中查找集群时),DBSCAN的方法可能不够用,从而导致性能不佳。在本文中,我们提出了DBSTexC,一种新的基于密度的聚类算法,使用Twitter上的空间文本信息。我们首先将与POI相关和与POI无关的tweet分别定义为包含和不包含POI名称或其连贯变体的记录。通过考虑与poi相关和与poi无关的tweet的比例,我们的DBSTexC算法在F1分数及其变体方面显示出比DBSCAN情况更高的聚类质量。DBSTexC可以被认为是DBSCAN的一个通用版本,因为我们发现,当输入是同构的时,它的执行与DBSCAN相同,而当给定异构输入数据类型时,它的执行远远优于DBSCAN。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DBSTexC: Density-Based Spatio-Textual Clustering on Twitter
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, where it can discover multiple clusters with arbitrary shapes. DBSCAN works properly when the input data type is homogeneous, but the DBSCAN's approach may not be sufficient when the input dataset has textual heterogeneity (e.g., when we intend to find clusters from geo-tagged posts on social media relevant to a certain point-of-interest (POI)), thus leading to poor performance. In this paper, we present DBSTexC, a new density-based clustering algorithm using spatio--textual information on Twitter. We first define POI-relevant and POI-irrelevant tweets as the records that contain and do not contain a POI name or its coherent variations, respectively. By taking into account the fractions of POI-relevant and POI-irrelevant tweets, our DBSTexC algorithm shows a much higher clustering quality than the DBSCAN case in terms of the F1 score and its variants. DBSTexC can be thought of as a generalized version of DBSCAN due to the findings that it performs identically as DBSCAN when the inputs are homogeneous and far outperforms DBSCAN when the heterogeneous input data type is given.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信