Urdu Sentiment Analysis

IF 0.5 Q4 COMPUTER SCIENCE, THEORY & METHODS
Iffraah Rehman, Tariq Rahim Soomro
{"title":"Urdu Sentiment Analysis","authors":"Iffraah Rehman, Tariq Rahim Soomro","doi":"10.2478/acss-2022-0004","DOIUrl":null,"url":null,"abstract":"Abstract The world is heading towards more modernized and digitalized data and therefore a significant growth is observed in the active number of social media users with each passing day. Each post and comment can give an insight into valuable information about a certain topic or issue, a product or a brand, etc. Similarly, the process to uncover the underlying information from the opinion that a person keeps about any entity is called a sentiment analysis. The analysis can be carried out through two main approaches, i.e., either lexicon-based or machine learning algorithms. A significant amount of work in the different domains has been done in numerous languages for sentiment analysis, but minimal research has been conducted on the national language of Pakistan, which is Urdu. Twitter users who are familiar with Urdu update the tweets in two different textual formats either in Urdu Script (Nastaleeq) or in Roman Urdu. Thus, the paper is an attempt to perform the sentiment analysis on the Urdu language by extracting the tweets (Nastaleeq and Roman Urdu both) from Twitter using Tweepy API. A machine learning-based approach has been adopted for this study and the tool opted for the purpose is WEKA. The best algorithm was identified based on evaluation metrics, which comprise the number of correctly and incorrectly classified instances, accuracy, precision, and recall. SMO was found to be the most suitable machine learning algorithm for performing the sentiment analysis on Urdu (Nastaleeq) tweets, while the Roman Urdu Random Forest algorithm was identified as the best one.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"85 10 1","pages":"30 - 42"},"PeriodicalIF":0.5000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2022-0004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract The world is heading towards more modernized and digitalized data and therefore a significant growth is observed in the active number of social media users with each passing day. Each post and comment can give an insight into valuable information about a certain topic or issue, a product or a brand, etc. Similarly, the process to uncover the underlying information from the opinion that a person keeps about any entity is called a sentiment analysis. The analysis can be carried out through two main approaches, i.e., either lexicon-based or machine learning algorithms. A significant amount of work in the different domains has been done in numerous languages for sentiment analysis, but minimal research has been conducted on the national language of Pakistan, which is Urdu. Twitter users who are familiar with Urdu update the tweets in two different textual formats either in Urdu Script (Nastaleeq) or in Roman Urdu. Thus, the paper is an attempt to perform the sentiment analysis on the Urdu language by extracting the tweets (Nastaleeq and Roman Urdu both) from Twitter using Tweepy API. A machine learning-based approach has been adopted for this study and the tool opted for the purpose is WEKA. The best algorithm was identified based on evaluation metrics, which comprise the number of correctly and incorrectly classified instances, accuracy, precision, and recall. SMO was found to be the most suitable machine learning algorithm for performing the sentiment analysis on Urdu (Nastaleeq) tweets, while the Roman Urdu Random Forest algorithm was identified as the best one.
乌尔都语情感分析
世界正朝着更加现代化和数字化的方向发展,因此社交媒体的活跃用户数量日益显著增长。每一篇帖子和评论都可以提供关于某个主题或问题、产品或品牌等有价值信息的见解。同样,从一个人对任何实体的看法中发现潜在信息的过程被称为情感分析。分析可以通过两种主要方法进行,即基于词典或机器学习算法。在不同领域的大量工作已经在许多语言中进行了情感分析,但对巴基斯坦的国家语言乌尔都语进行的研究很少。熟悉乌尔都语的Twitter用户以两种不同的文本格式更新tweet,一种是乌尔都语脚本(Nastaleeq),另一种是罗马乌尔都语。因此,本文试图通过使用Tweepy API从Twitter中提取推文(Nastaleeq和Roman Urdu)来对乌尔都语进行情感分析。本研究采用了一种基于机器学习的方法,为此选择的工具是WEKA。根据评估指标确定最佳算法,评估指标包括正确和错误分类实例的数量、准确性、精度和召回率。SMO被认为是最适合对乌尔都语(Nastaleeq)推文进行情感分析的机器学习算法,而罗马乌尔都语随机森林算法被认为是最好的算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Computer Systems
Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-
自引率
10.00%
发文量
9
审稿时长
30 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信