Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter

IF 1.1 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Science and Engineering Pub Date : 2019-05-01 DOI:10.6688/JISE.201905_35(3).0011

Sharath Kumar, Kuochen Wang, Shi-Min Shen

{"title":"Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter","authors":"Sharath Kumar, Kuochen Wang, Shi-Min Shen","doi":"10.6688/JISE.201905_35(3).0011","DOIUrl":null,"url":null,"abstract":"With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this paper, we propose corpus-based topic derivation (CTD) approach that combines a Twitter corpus and LF-LDA, which is a text processing model to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increases from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then the difference can be used to calculate volatility to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision (MAP) of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week).","PeriodicalId":50177,"journal":{"name":"Journal of Information Science and Engineering","volume":"102 5 1","pages":"675-696"},"PeriodicalIF":1.1000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.6688/JISE.201905_35(3).0011","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 5

Abstract

With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this paper, we propose corpus-based topic derivation (CTD) approach that combines a Twitter corpus and LF-LDA, which is a text processing model to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increases from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then the difference can be used to calculate volatility to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision (MAP) of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week).

查看原文本刊更多论文

基于语料库的话题派生和基于时间戳的Twitter热门标签预测

随着互联网、移动平台、在线商务和社交媒体服务的使用，人类行为的足迹可以很容易地记录在数字世界中，从而产生极其大规模的数据。Twitter作为一个大数据社交网络，成为捕捉世界上发生的最新事件的最重要来源之一。从Twitter派生主题对于各种应用程序都很重要，例如情况感知、市场分析、内容过滤和推荐。然而，在Twitter中，由于tweet被限制在140个字符以内，很难实现高纯度的主题派生。之前Twitter中关于主题派生的工作存在纯度低的问题。本文提出了基于语料库的主题衍生(CTD)方法，该方法将Twitter语料库和LF-LDA相结合，LF-LDA是一种文本处理模型，用于识别主题和相似标签的聚类。我们使用非对称主题LF-LDA来获得更好的主题纯度。与具有代表性的相关工作intJNMF相比，我们提出的CTD的纯度(F-measure)从5.26%(27.81%)提高到11.32%(34.28%)，适用于20 ~ 100个主题。我们还提出了一种基于时间戳的流行标签预测(TPHP)方法，方法是创建趋势标签列表(THLs)，这是许多用户使用的标签列表，并利用tweet中的时间戳。我们使用编辑距离来找到连续thl之间的差异。然后，这个差值可以用来计算波动性，以发现人们对现实世界事件的反应。与代表性的相关工作Hybrid+相比，我们的TPHP的平均精度(MAP)提高了19.45%(周)，15.08%(周)和16.95%(月-周)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Science and Engineering 工程技术-计算机：信息系统

CiteScore

2.00

自引率

0.00%

发文量

审稿时长

8 months

期刊介绍： The Journal of Information Science and Engineering is dedicated to the dissemination of information on computer science, computer engineering, and computer systems. This journal encourages articles on original research in the areas of computer hardware, software, man-machine interface, theory and applications. tutorial papers in the above-mentioned areas, and state-of-the-art papers on various aspects of computer systems and applications.