使用机器学习技术分析马拉语-英语代码混合的社交媒体数据的情绪

2023 International Conference on Emerging Smart Computing and Informatics (ESCI) Pub Date : 2023-03-01 DOI:10.1109/ESCI56872.2023.10100304

Varad Patwardhan, Gauri Takawane, Nirmayi Kelkar, Omkar Gaikwad, Rutwik Saraf, S. Sonawane

{"title":"使用机器学习技术分析马拉语-英语代码混合的社交媒体数据的情绪","authors":"Varad Patwardhan, Gauri Takawane, Nirmayi Kelkar, Omkar Gaikwad, Rutwik Saraf, S. Sonawane","doi":"10.1109/ESCI56872.2023.10100304","DOIUrl":null,"url":null,"abstract":"A vast amount of data is generated every day through social media platforms. Various techniques and methodologies are used to bring different forms of data to use. One such form of data is textual data generated from social media platforms in the form of chats, comments, and tweets. The term “code-mixed data” describes data that combines components of different languages or linguistic subgroups such as text written in several different languages or speech that shifts between languages. Due to increased social media use and worldwide communication, many individuals are using multiple languages in their daily communication, making this type of data even more crucial. Machine translation, speech recognition, and text categorization are just a few examples of natural language processing activities that can be performed on code-mixed data. Research on code-mixed data can also aid in the understanding of multilingual communication. In this paper, we present an empirical study on the problem of word-level language identification and text normalisation for Marathi-English code-mixed text. We have created a new dataset of 1009 sentences that exhibit code-mixing of Marathi (Romanised) and English textual data. This data was collected from Whatsapp chats and Youtube comments.","PeriodicalId":441215,"journal":{"name":"2023 International Conference on Emerging Smart Computing and Informatics (ESCI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysing The Sentiments Of Marathi-English Code-Mixed Social Media Data Using Machine Learning Techniques\",\"authors\":\"Varad Patwardhan, Gauri Takawane, Nirmayi Kelkar, Omkar Gaikwad, Rutwik Saraf, S. Sonawane\",\"doi\":\"10.1109/ESCI56872.2023.10100304\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A vast amount of data is generated every day through social media platforms. Various techniques and methodologies are used to bring different forms of data to use. One such form of data is textual data generated from social media platforms in the form of chats, comments, and tweets. The term “code-mixed data” describes data that combines components of different languages or linguistic subgroups such as text written in several different languages or speech that shifts between languages. Due to increased social media use and worldwide communication, many individuals are using multiple languages in their daily communication, making this type of data even more crucial. Machine translation, speech recognition, and text categorization are just a few examples of natural language processing activities that can be performed on code-mixed data. Research on code-mixed data can also aid in the understanding of multilingual communication. In this paper, we present an empirical study on the problem of word-level language identification and text normalisation for Marathi-English code-mixed text. We have created a new dataset of 1009 sentences that exhibit code-mixing of Marathi (Romanised) and English textual data. This data was collected from Whatsapp chats and Youtube comments.\",\"PeriodicalId\":441215,\"journal\":{\"name\":\"2023 International Conference on Emerging Smart Computing and Informatics (ESCI)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Emerging Smart Computing and Informatics (ESCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESCI56872.2023.10100304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Emerging Smart Computing and Informatics (ESCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESCI56872.2023.10100304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

每天都有大量数据通过社交媒体平台产生。使用各种技术和方法来使用不同形式的数据。其中一种数据形式是社交媒体平台以聊天、评论和tweet的形式生成的文本数据。术语“代码混合数据”描述了结合了不同语言或语言子组的组件的数据，例如用几种不同语言编写的文本或在语言之间转换的语音。由于社交媒体的使用和全球交流的增加，许多人在日常交流中使用多种语言，这使得这类数据变得更加重要。机器翻译、语音识别和文本分类只是可以在代码混合数据上执行的自然语言处理活动的几个例子。对代码混合数据的研究也有助于理解多语言交流。本文对马拉地语-英语语码混合文本的词级语言识别和文本规范化问题进行了实证研究。我们创建了一个包含1009个句子的新数据集，这些句子展示了马拉地语(罗马化)和英语文本数据的代码混合。这些数据是从Whatsapp聊天和Youtube评论中收集的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysing The Sentiments Of Marathi-English Code-Mixed Social Media Data Using Machine Learning Techniques

A vast amount of data is generated every day through social media platforms. Various techniques and methodologies are used to bring different forms of data to use. One such form of data is textual data generated from social media platforms in the form of chats, comments, and tweets. The term “code-mixed data” describes data that combines components of different languages or linguistic subgroups such as text written in several different languages or speech that shifts between languages. Due to increased social media use and worldwide communication, many individuals are using multiple languages in their daily communication, making this type of data even more crucial. Machine translation, speech recognition, and text categorization are just a few examples of natural language processing activities that can be performed on code-mixed data. Research on code-mixed data can also aid in the understanding of multilingual communication. In this paper, we present an empirical study on the problem of word-level language identification and text normalisation for Marathi-English code-mixed text. We have created a new dataset of 1009 sentences that exhibit code-mixing of Marathi (Romanised) and English textual data. This data was collected from Whatsapp chats and Youtube comments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 International Conference on Emerging Smart Computing and Informatics (ESCI)

自引率

0.00%

发文量