Mudasar Ahmed Soomro, Rafia Naz Memon, Asghar Ali Chandio, Mehwish Leghari, Muhammad Hanif Soomro
{"title":"A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysis.","authors":"Mudasar Ahmed Soomro, Rafia Naz Memon, Asghar Ali Chandio, Mehwish Leghari, Muhammad Hanif Soomro","doi":"10.1016/j.dib.2024.111170","DOIUrl":null,"url":null,"abstract":"<p><p>Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like \"2mro\" is a nonstandard spelling for tomorrow. This paper aims to collect two Roman Urdu datasets: one is roman Urdu words with various spelling variations. This dataset contains 5244 Roman Urdu words, within which we have included variations in word spellings ranging from (one) to (five) different spellings for each word. The second dataset consists of Roman Urdu reviews, which were collected from (seven) different internet-based sources. This dataset contains multiclass reviews, namely \"very positive,\" \"positive,\" \"very negative,\" \"negative,\" and \"neutral\", respectively. We gathered a total of 28,090 reviews. The sentiments of the reviews were made by the domain experts who were familiar with the Urdu language.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"57 ","pages":"111170"},"PeriodicalIF":1.0000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683287/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.dib.2024.111170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysis.
Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like "2mro" is a nonstandard spelling for tomorrow. This paper aims to collect two Roman Urdu datasets: one is roman Urdu words with various spelling variations. This dataset contains 5244 Roman Urdu words, within which we have included variations in word spellings ranging from (one) to (five) different spellings for each word. The second dataset consists of Roman Urdu reviews, which were collected from (seven) different internet-based sources. This dataset contains multiclass reviews, namely "very positive," "positive," "very negative," "negative," and "neutral", respectively. We gathered a total of 28,090 reviews. The sentiments of the reviews were made by the domain experts who were familiar with the Urdu language.
期刊介绍:
Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.