M. Al-Kabi, Areej A. Al-Qwaqenah, Amal H. Gigieh, Kholoud Alsmearat, M. Al-Ayyoub, I. Alsmadi
{"title":"构建Arabie情感分析的标准数据集:识别潜在的注释陷阱","authors":"M. Al-Kabi, Areej A. Al-Qwaqenah, Amal H. Gigieh, Kholoud Alsmearat, M. Al-Ayyoub, I. Alsmadi","doi":"10.1109/AICCSA.2016.7945822","DOIUrl":null,"url":null,"abstract":"Sentiment Analysis (SA) is one of the hottest research fields nowadays. It is concerned with identifying the sentiment conveyed in a piece of text. The current efforts in SA require the existence of standard datasets for training/testing purposes. Such datasets already exist for some languages such as English. Unfortunately, the same cannot be said about other languages such as Arabic. Currently existing Arabic SA datasets are restricted (in their domain, size, dialects covered, etc.) and/or have limited availability. Moreover, the annotation process did not receive the proper attention it deserves. Some of the existing datasets relied on the author's point of view for annotation, while others employed annotators, but did not take into account the personal variations between the annotators and how would that affect their agreement. This study presents our efforts to build a standard Arabic dataset with the above concerns in mind. The constructed dataset is intended for generic use as it contains reviews from different domains written in Modern Standard Arabic (MSA) as well as several dialects. As for the annotation process, it is given high attention by studying the inter-annotator agreements and investigating the potential factors affecting them.","PeriodicalId":448329,"journal":{"name":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Building a standard dataset for Arabie sentiment analysis: Identifying potential annotation pitfalls\",\"authors\":\"M. Al-Kabi, Areej A. Al-Qwaqenah, Amal H. Gigieh, Kholoud Alsmearat, M. Al-Ayyoub, I. Alsmadi\",\"doi\":\"10.1109/AICCSA.2016.7945822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment Analysis (SA) is one of the hottest research fields nowadays. It is concerned with identifying the sentiment conveyed in a piece of text. The current efforts in SA require the existence of standard datasets for training/testing purposes. Such datasets already exist for some languages such as English. Unfortunately, the same cannot be said about other languages such as Arabic. Currently existing Arabic SA datasets are restricted (in their domain, size, dialects covered, etc.) and/or have limited availability. Moreover, the annotation process did not receive the proper attention it deserves. Some of the existing datasets relied on the author's point of view for annotation, while others employed annotators, but did not take into account the personal variations between the annotators and how would that affect their agreement. This study presents our efforts to build a standard Arabic dataset with the above concerns in mind. The constructed dataset is intended for generic use as it contains reviews from different domains written in Modern Standard Arabic (MSA) as well as several dialects. As for the annotation process, it is given high attention by studying the inter-annotator agreements and investigating the potential factors affecting them.\",\"PeriodicalId\":448329,\"journal\":{\"name\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICCSA.2016.7945822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA.2016.7945822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Building a standard dataset for Arabie sentiment analysis: Identifying potential annotation pitfalls
Sentiment Analysis (SA) is one of the hottest research fields nowadays. It is concerned with identifying the sentiment conveyed in a piece of text. The current efforts in SA require the existence of standard datasets for training/testing purposes. Such datasets already exist for some languages such as English. Unfortunately, the same cannot be said about other languages such as Arabic. Currently existing Arabic SA datasets are restricted (in their domain, size, dialects covered, etc.) and/or have limited availability. Moreover, the annotation process did not receive the proper attention it deserves. Some of the existing datasets relied on the author's point of view for annotation, while others employed annotators, but did not take into account the personal variations between the annotators and how would that affect their agreement. This study presents our efforts to build a standard Arabic dataset with the above concerns in mind. The constructed dataset is intended for generic use as it contains reviews from different domains written in Modern Standard Arabic (MSA) as well as several dialects. As for the annotation process, it is given high attention by studying the inter-annotator agreements and investigating the potential factors affecting them.