A Dataset and Classifier for Recognizing Social Media English

NUT@EMNLP Pub Date : 2017-09-01 DOI:10.18653/v1/W17-4408
Su Lin Blodgett, Johnny Wei, Brendan T. O'Connor
{"title":"A Dataset and Classifier for Recognizing Social Media English","authors":"Su Lin Blodgett, Johnny Wei, Brendan T. O'Connor","doi":"10.18653/v1/W17-4408","DOIUrl":null,"url":null,"abstract":"While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NUT@EMNLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W17-4408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.
社交媒体英语识别的数据集和分类器
虽然语言识别在标准文本上效果很好,但在社交媒体语言,尤其是方言语言——甚至是英语——上的表现要差得多。首先,为了支持英语语言识别工作,我们提供了一个新的tweet数据集,该数据集针对英语和非英语进行了注释,并注意歧义、代码切换和自动生成问题。它是从所有公开消息中随机抽取的,避免了对已有语言分类器的偏见。其次,我们发现一个人口统计学语言模型——用类似于twitter上几个美国少数民族使用的语言来识别消息——与传统的监督语言标识符结合使用时,可以用来提高英语语言识别性能。它在几乎不损失准确率的情况下提高了记忆力,令人惊讶的是,对于非美国人写的英文信息也是如此。作者。我们的数据集和标识符集合可在线获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信