A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications

Prashanth Kannadaguli
{"title":"A Code-Diverse Tulu-English Dataset For NLP Based Sentiment Analysis Applications","authors":"Prashanth Kannadaguli","doi":"10.1109/ACTS53447.2021.9708241","DOIUrl":null,"url":null,"abstract":"Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.","PeriodicalId":201741,"journal":{"name":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Advanced Communication Technologies and Signal Processing (ACTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACTS53447.2021.9708241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Due to expanded praxis of social media, there is an elevated interest in the Natural Language Processing (NLP) of textual substance. Code swapping is a ubiquitous paradox in multilingual nation and the social communication shows mixing of a low resourced language with a highly resourced language mostly written in non-native script in the same text. It is essential to refine the code swapped text to support distinctive NLP tasks such as Machine Translation, Automated Conversational Systems and Sentiment Analysis (SA). The preeminent objective of SA is to identify and analyze the attitude, opinion, emotion or the sentiment in the dataset. Though there are multiple systems skilled on monodialectal dataset, all of them break down when it comes for code-diverse data because of the heightened intricacy of blending at various standards of text. Nonetheless, there exist a smaller number of assets for modelling such definitive code-mixed data and the Machine Learning or the Deep Learning algorithms enforcing supervised learning approach yield the better results compared to the unsupervised learning. Such datasets are available for Hindi-English, Tamil-English, Malayalam-English, Bengali-English, German-English, Spanish-English, Japanese-English, Arabic-English etc. Though our research is concentrated towards NLP for emotion and sentiment detection of Tulu, a vibrant south Indian language, to start with, we build the first ever platinum standard corpus for NLP applications of code-diverse text in Tulu-English, as there is no such resource in our native language. The performance analysis of our dataset through Krippendorff’s Alpha value of 0.9 indicates that it is a benchmark in development of Automatic Sentiment Analysis system for Tulu.
基于NLP的情感分析应用的代码多样性图鲁-英语数据集
随着社交媒体应用的不断扩大,人们对文本内容的自然语言处理(NLP)越来越感兴趣。代码交换是多语言国家普遍存在的矛盾现象,社会交际表现为低资源语言与高资源语言在同一文本中以非母语文字书写的混合。为了支持机器翻译、自动对话系统和情感分析(SA)等独特的NLP任务,必须对交换文本的代码进行优化。SA的主要目标是识别和分析数据集中的态度、意见、情感或情绪。虽然有多个系统能够处理单方言数据集,但当涉及到代码多样化的数据时,它们都崩溃了,因为混合不同标准的文本会变得更加复杂。尽管如此,对于这种明确的代码混合数据进行建模的资产数量较少,与无监督学习相比,机器学习或深度学习算法执行监督学习方法产生更好的结果。这些数据集可用于印度语英语,泰米尔语英语,马拉雅拉姆语英语,孟加拉语英语,德语英语,西班牙语英语,日语英语,阿拉伯语英语等。虽然我们的研究主要集中在对图鲁语(一种充满活力的南印度语言)进行情感和情感检测的NLP,但我们首先建立了第一个用于图鲁英语代码多样化文本的NLP应用的白金标准语料库,因为在我们的母语中没有这样的资源。通过Krippendorff的Alpha值为0.9对我们的数据集进行性能分析,表明它是图鲁自动情感分析系统开发的基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信