NIRMAL: Automatic identification of software relevant tweets leveraging language model

2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER) Pub Date : 2015-03-01 DOI:10.1109/SANER.2015.7081855

Abhishek Sharma, Yuan Tian, D. Lo

{"title":"NIRMAL: Automatic identification of software relevant tweets leveraging language model","authors":"Abhishek Sharma, Yuan Tian, D. Lo","doi":"10.1109/SANER.2015.7081855","DOIUrl":null,"url":null,"abstract":"Twitter is one of the most widely used social media platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active users generate close to 500 million tweets per day. Such rapid generation of user generated content in large magnitudes results in the problem of information overload. Users who are interested in information related to a particular domain have limited means to filter out irrelevant tweets and tend to get lost in the huge amount of data they encounter. A recent study by Singer et al. found that software developers use Twitter to stay aware of industry trends, to learn from others, and to network with other developers. However, Singer et al. also reported that developers often find Twitter streams to contain too much noise which is a barrier to the adoption of Twitter. In this paper, to help developers cope with noise, we propose a novel approach named NIRMAL, which automatically identifies software relevant tweets from a collection or stream of tweets. Our approach is based on language modeling which learns a statistical model based on a training corpus (i.e., set of documents). We make use of a subset of posts from StackOverflow, a programming question and answer site, as a training corpus to learn a language model. A corpus of tweets was then used to test the effectiveness of the trained language model. The tweets were sorted based on the rank the model assigned to each of the individual tweets. The top 200 tweets were then manually analyzed to verify whether they are software related or not, and then an accuracy score was calculated. The results show that decent accuracy scores can be achieved by various variants of NIRMAL, which indicates that NIRMAL can effectively identify software related tweets from a huge corpus of tweets.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2015.7081855","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Twitter is one of the most widely used social media platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active users generate close to 500 million tweets per day. Such rapid generation of user generated content in large magnitudes results in the problem of information overload. Users who are interested in information related to a particular domain have limited means to filter out irrelevant tweets and tend to get lost in the huge amount of data they encounter. A recent study by Singer et al. found that software developers use Twitter to stay aware of industry trends, to learn from others, and to network with other developers. However, Singer et al. also reported that developers often find Twitter streams to contain too much noise which is a barrier to the adoption of Twitter. In this paper, to help developers cope with noise, we propose a novel approach named NIRMAL, which automatically identifies software relevant tweets from a collection or stream of tweets. Our approach is based on language modeling which learns a statistical model based on a training corpus (i.e., set of documents). We make use of a subset of posts from StackOverflow, a programming question and answer site, as a training corpus to learn a language model. A corpus of tweets was then used to test the effectiveness of the trained language model. The tweets were sorted based on the rank the model assigned to each of the individual tweets. The top 200 tweets were then manually analyzed to verify whether they are software related or not, and then an accuracy score was calculated. The results show that decent accuracy scores can be achieved by various variants of NIRMAL, which indicates that NIRMAL can effectively identify software related tweets from a huge corpus of tweets.

查看原文本刊更多论文

NIRMAL:利用语言模型自动识别软件相关推文

Twitter是当今使用最广泛的社交媒体平台之一。它允许用户分享和查看140个字符的被称为“tweets”的短消息。大约2.84亿活跃用户每天产生近5亿条推文。如此快速、大规模的用户生成内容导致了信息过载的问题。对特定领域相关信息感兴趣的用户过滤不相关推文的手段有限，容易迷失在他们遇到的大量数据中。Singer等人最近的一项研究发现，软件开发人员使用Twitter来了解行业趋势，向他人学习，并与其他开发人员建立联系。然而，Singer等人也报告说，开发人员经常发现Twitter流包含太多噪音，这是采用Twitter的一个障碍。在本文中，为了帮助开发人员处理噪声，我们提出了一种名为NIRMAL的新方法，该方法可以自动从推文集合或流中识别与软件相关的推文。我们的方法是基于语言建模，学习基于训练语料库(即一组文档)的统计模型。我们使用来自编程问答网站StackOverflow的帖子子集作为学习语言模型的训练语料库。然后使用tweet语料库来测试训练后的语言模型的有效性。根据模型分配给每个单独tweet的等级对tweet进行排序。然后人工分析前200条推文，以验证它们是否与软件相关，然后计算准确性分数。结果表明，NIRMAL的各种变体都可以获得不错的准确率分数，这表明NIRMAL可以有效地从庞大的推文语料库中识别与软件相关的推文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)

自引率

0.00%

发文量