Text classification for automatic detection of alcohol use-related tweets: A feasibility study

Y. Aphinyanaphongs, Bisakha Ray, A. Statnikov, P. Krebs
{"title":"Text classification for automatic detection of alcohol use-related tweets: A feasibility study","authors":"Y. Aphinyanaphongs, Bisakha Ray, A. Statnikov, P. Krebs","doi":"10.1109/IRI.2014.7051877","DOIUrl":null,"url":null,"abstract":"We present a feasibility study using text classification to classify tweets about alcohol use. Alcohol use is the most widely used substance in the US and is the leading risk factor for premature morbidity and mortality globally. Understanding use patterns and locations is an important step toward prevention, moderation, and control of alcohol outlets. Social media may provide an alternate way to measure alcohol use in real time. This feasibility study explores text classification methodologies for identifying alcohol use tweets. We labeled 34,563 geo-located New York City tweets collected in a 24 hour period over New Year's Day 2012. We preprocessed the tweets into stem/ not stemmed and unigram/ bigram representations. We then applied multinomial naïve Bayes, a linear SVM, Bayesian logistic regression, and random forests to the classification task. Using 10 fold cross-validation, the algorithms performed with area under the receiver operating curve of 0.66, 0.91, 0.93, and 0.94 respectively. We also compare to a human constructed Boolean search for the same tweets and the text classification method is competitive with this hand crafted search. In conclusion, we show that the task of automatically identifying alcohol related tweets is highly feasible and paves the way for future research to improve these classifiers.","PeriodicalId":360013,"journal":{"name":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2014.7051877","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

We present a feasibility study using text classification to classify tweets about alcohol use. Alcohol use is the most widely used substance in the US and is the leading risk factor for premature morbidity and mortality globally. Understanding use patterns and locations is an important step toward prevention, moderation, and control of alcohol outlets. Social media may provide an alternate way to measure alcohol use in real time. This feasibility study explores text classification methodologies for identifying alcohol use tweets. We labeled 34,563 geo-located New York City tweets collected in a 24 hour period over New Year's Day 2012. We preprocessed the tweets into stem/ not stemmed and unigram/ bigram representations. We then applied multinomial naïve Bayes, a linear SVM, Bayesian logistic regression, and random forests to the classification task. Using 10 fold cross-validation, the algorithms performed with area under the receiver operating curve of 0.66, 0.91, 0.93, and 0.94 respectively. We also compare to a human constructed Boolean search for the same tweets and the text classification method is competitive with this hand crafted search. In conclusion, we show that the task of automatically identifying alcohol related tweets is highly feasible and paves the way for future research to improve these classifiers.
酒精使用相关推文自动检测的文本分类:可行性研究
我们提出了一项可行性研究,使用文本分类对有关酒精使用的推文进行分类。酒精是美国使用最广泛的物质,也是全球过早发病和死亡的主要风险因素。了解饮酒模式和地点是预防、节制和控制酒精出口的重要一步。社交媒体可能提供另一种实时测量酒精使用情况的方法。本可行性研究探讨了识别酒精使用推文的文本分类方法。我们标记了2012年元旦24小时内收集的34,563条纽约市地理定位推文。我们将tweet预处理为词干/非词干和单字符/双字符表示。然后,我们将多项naïve贝叶斯,线性支持向量机,贝叶斯逻辑回归和随机森林应用于分类任务。通过10倍交叉验证,算法的受试者工作曲线下面积分别为0.66、0.91、0.93和0.94。我们还将相同的tweet与人工构造的布尔搜索进行了比较,并且文本分类方法与这种手工制作的搜索具有竞争力。总之,我们表明自动识别酒精相关推文的任务是高度可行的,并为未来研究改进这些分类器铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信