Discriminating between Brazilian and European Portuguese National Varieties on Twitter Texts

D. Castro, E. Souza, Adriano Oliveira
{"title":"Discriminating between Brazilian and European Portuguese National Varieties on Twitter Texts","authors":"D. Castro, E. Souza, Adriano Oliveira","doi":"10.1109/BRACIS.2016.056","DOIUrl":null,"url":null,"abstract":"Twitter is one of the most used social media with users generating about 1 million messages per day. As a result of the expansion of this microblog, there is a diversity of languages used by users and many studies aimed at identifying the language of tweets. The third most used language on Twitter is Portuguese, a pluricentric language with two national standard varieties: Brazilian Portuguese and European Portuguese. Identifying a language variety may positively impact various Natural Language Processing tasks, but accomplishing this task is still regarded as one of the bottlenecks in this area, especially when combined with another bottleneck, language identification applied to short texts. Thus, given these challenges, this paper provides a current view on the automatic discrimination of the two main Portuguese language varieties on Twitter texts by using an acknowledged approach with different techniques and features in order to get an optimum configuration to fit our problem. Results reached 0.9271 for accuracy using an ensemble method, which combines character 6-grams and word unigrams and bigrams.","PeriodicalId":183149,"journal":{"name":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2016.056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Twitter is one of the most used social media with users generating about 1 million messages per day. As a result of the expansion of this microblog, there is a diversity of languages used by users and many studies aimed at identifying the language of tweets. The third most used language on Twitter is Portuguese, a pluricentric language with two national standard varieties: Brazilian Portuguese and European Portuguese. Identifying a language variety may positively impact various Natural Language Processing tasks, but accomplishing this task is still regarded as one of the bottlenecks in this area, especially when combined with another bottleneck, language identification applied to short texts. Thus, given these challenges, this paper provides a current view on the automatic discrimination of the two main Portuguese language varieties on Twitter texts by using an acknowledged approach with different techniques and features in order to get an optimum configuration to fit our problem. Results reached 0.9271 for accuracy using an ensemble method, which combines character 6-grams and word unigrams and bigrams.
推特文本中巴西和欧洲葡萄牙国家品种的区别
推特是最常用的社交媒体之一,用户每天生成约100万条消息。由于这条微博的扩张,用户使用的语言也变得多样化,很多研究都是为了识别推文的语言。Twitter上第三大使用语言是葡萄牙语,这是一种多中心语言,有两种国家标准变体:巴西葡萄牙语和欧洲葡萄牙语。识别语言种类可能会对各种自然语言处理任务产生积极影响,但完成这一任务仍然被认为是该领域的瓶颈之一,特别是当与另一个瓶颈相结合时,语言识别应用于短文本。因此,鉴于这些挑战,本文通过使用具有不同技术和特征的公认方法,提供了Twitter文本上两种主要葡萄牙语变体的自动识别的当前观点,以便获得适合我们问题的最佳配置。采用6-g字组合、单字组合和双字组合的方法,准确率达到0.9271。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信