Scalable distributed first story detection using storm for twitter data

Mahesh G. Huddar, Manjula M. Ramannavar, N. Sidnal
{"title":"Scalable distributed first story detection using storm for twitter data","authors":"Mahesh G. Huddar, Manjula M. Ramannavar, N. Sidnal","doi":"10.1109/ICAETR.2014.7012915","DOIUrl":null,"url":null,"abstract":"Twitter is an online service that enables users to read and post tweets; thereby providing a wealth of information regarding breaking news stories. The problem of First Story Detection is to identify first stories about different events from streaming documents. The Locality sensitive hashing algorithm is the traditional approach used for First Story Detection. The documents have a high degree of lexical variation which makes First Story Detection a very difficult task. This work uses Twitter as the data source to address the problem of real-time First Story Detection. As twitter data contains a lot of spam, we built a dictionary of words to remove spam from the tweets. Further since the Twitter streaming data rate is high, we cannot use traditional Locality sensitive hashing algorithm to detect the first stories. We modify the Locality sensitive hashing algorithm to overcome this limitation while maintaining reasonable accuracy with improved performance. Also, we use Storm distributed platform, so that the system benefits from the robustness, scalability and efficiency that this framework offers.","PeriodicalId":196504,"journal":{"name":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAETR.2014.7012915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Twitter is an online service that enables users to read and post tweets; thereby providing a wealth of information regarding breaking news stories. The problem of First Story Detection is to identify first stories about different events from streaming documents. The Locality sensitive hashing algorithm is the traditional approach used for First Story Detection. The documents have a high degree of lexical variation which makes First Story Detection a very difficult task. This work uses Twitter as the data source to address the problem of real-time First Story Detection. As twitter data contains a lot of spam, we built a dictionary of words to remove spam from the tweets. Further since the Twitter streaming data rate is high, we cannot use traditional Locality sensitive hashing algorithm to detect the first stories. We modify the Locality sensitive hashing algorithm to overcome this limitation while maintaining reasonable accuracy with improved performance. Also, we use Storm distributed platform, so that the system benefits from the robustness, scalability and efficiency that this framework offers.
使用storm对twitter数据进行可扩展的分布式第一故事检测
Twitter是一项在线服务,用户可以阅读和发布推文;从而提供关于突发新闻故事的丰富信息。第一故事检测的问题是从流文档中识别关于不同事件的第一故事。局部敏感哈希算法是传统的第一故事检测方法。这些文件有高度的词汇变化,这使得第一故事检测成为一项非常困难的任务。这项工作使用Twitter作为数据源来解决实时第一故事检测的问题。由于twitter数据包含大量垃圾邮件,我们构建了一个单词字典来从tweet中删除垃圾邮件。此外,由于Twitter流数据率很高,我们无法使用传统的Locality敏感散列算法来检测首个故事。我们修改了位置敏感散列算法来克服这一限制,同时在提高性能的同时保持合理的精度。此外,我们使用Storm分布式平台,使系统受益于该框架提供的健壮性,可扩展性和效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信