A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Workshop on Spoken Language Technologies for Under-resourced Languages Pub Date : 2020-05-11 DOI:10.5281/ZENODO.4015234

Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, E. Sherly, John P. McCrae

{"title":"A Sentiment Analysis Dataset for Code-Mixed Malayalam-English","authors":"Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, E. Sherly, John P. McCrae","doi":"10.5281/ZENODO.4015234","DOIUrl":null,"url":null,"abstract":"There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"191","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Spoken Language Technologies for Under-resourced Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5281/ZENODO.4015234","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 191

Abstract

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

查看原文本刊更多论文

马来语-英语混合语码情感分析数据集

对社交媒体文本情感分析的需求越来越大，这些文本大多是代码混合的。由于在文本的不同层次上混合的复杂性，单语言数据上训练的系统在代码混合数据上失败。然而，很少有资源可用于代码混合数据来创建特定于该数据的模型。尽管在多语言和跨语言情感分析方面的许多研究都使用了半监督或非监督方法，但监督方法仍然表现更好。只有一些流行语言的数据集可用，如英语-西班牙语、英语-印地语和英语-汉语。没有马来亚拉姆-英语代码混合数据可用的资源。本文提出了一种新的金标准语料库，用于马拉雅拉姆-英语语码混合文本的情感分析。这个黄金标准语料库的数据集的Krippendorff alpha值高于0.8。我们使用这个新的语料库为马来语-英语代码混合文本的情感分析提供了基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Spoken Language Technologies for Under-resourced Languages

自引率

0.00%

发文量