A robust algorithm for determining the newsworthiness of microblogs

2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer) Pub Date : 2015-08-01 DOI:10.1109/ICTER.2015.7377679

P. K. K. Madhawa, Ajantha S Atukorale

{"title":"A robust algorithm for determining the newsworthiness of microblogs","authors":"P. K. K. Madhawa, Ajantha S Atukorale","doi":"10.1109/ICTER.2015.7377679","DOIUrl":null,"url":null,"abstract":"Microblogging platforms such as Twitter have become a primary medium for people to share their experiences and opinions on a broad range of topics. Because posts on Twitter are publicly viewable by default, Twitter can be used to get up-to-date information on events like natural disasters, disease outbreaks or sports events. Building a cohesive summary out of tweets on long running events is an interesting problem which research community is interested in. But the abundance of tweets containing user opinions and their sentiments towards a topic necessitates the need of extracting newsworthy tweets from a large stream of tweets on a single topic. But most of such methods require large hand-labeled corpora to be used for training the model. But this is not practical for a rapidly updating medium like Twitter. In this paper we address this problem with the introduction of a novel heuristic based annotation scheme to generate training dataset for the system. A hand-labeled corpus of tweets is only used for benchmarking the objectivity classifier. Our classifier could achieve an F1-score of 80% on a manually annotated gold standard dataset.","PeriodicalId":142561,"journal":{"name":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"278 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTER.2015.7377679","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Microblogging platforms such as Twitter have become a primary medium for people to share their experiences and opinions on a broad range of topics. Because posts on Twitter are publicly viewable by default, Twitter can be used to get up-to-date information on events like natural disasters, disease outbreaks or sports events. Building a cohesive summary out of tweets on long running events is an interesting problem which research community is interested in. But the abundance of tweets containing user opinions and their sentiments towards a topic necessitates the need of extracting newsworthy tweets from a large stream of tweets on a single topic. But most of such methods require large hand-labeled corpora to be used for training the model. But this is not practical for a rapidly updating medium like Twitter. In this paper we address this problem with the introduction of a novel heuristic based annotation scheme to generate training dataset for the system. A hand-labeled corpus of tweets is only used for benchmarking the objectivity classifier. Our classifier could achieve an F1-score of 80% on a manually annotated gold standard dataset.

查看原文本刊更多论文

一种确定微博新闻价值的稳健算法

像Twitter这样的微博平台已经成为人们就广泛的话题分享经验和观点的主要媒介。因为Twitter上的帖子默认情况下是公开可见的，所以Twitter可以用来获取自然灾害、疾病爆发或体育赛事等事件的最新信息。从长期运行的事件的tweet中构建一个有凝聚力的摘要是研究社区感兴趣的一个有趣的问题。但是，大量包含用户观点和他们对某个主题的看法的推文，使得需要从单个主题的大量推文中提取有新闻价值的推文。但是大多数这样的方法都需要使用大型手工标记的语料库来训练模型。但这对于像Twitter这样快速更新的媒体来说是不切实际的。在本文中，我们通过引入一种新的启发式注释方案来解决这个问题，该方案为系统生成训练数据集。手工标记的tweet语料库仅用于对客观性分类器进行基准测试。我们的分类器可以在手动注释的金标准数据集上达到80%的f1分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)

自引率

0.00%

发文量