{"title":"Stylometric and Semantic Analysis of Demographically Diverse Non-native English Review Data","authors":"Salim Sazzed","doi":"10.1109/ASONAM55673.2022.10068612","DOIUrl":null,"url":null,"abstract":"The demographic knowledge facilitates a fine-grained interpretation of the user-generated review text and enables better decision-making. In this study, we aim to com-prehend how various attributes of non-native English text vary across demographically distinct groups. We introduce a non-native English corpus of around 1150 reviews representing four demographically diverse country-specific groups: Finland, Kenya, Bangladesh, and China. The reviews differ in various contexts, including geography, native language family, race and culture, and English proficiency levels of the reviewers. We then perform stylometric and semantic analysis on these distinct sets of reviews to unveil how the linguistic characteristics differ across the demography. The investigation reveals that stylometric features are mostly similar across the reviews of various groups; nevertheless, dissimilarities are observed in attributes, such as review length, presence of articles, or prepositions. We employ classical machine learning (ML) algorithms and transformer-based fine-tuned language models for categorizing the reviews into distinct demographic groups. We observe that semantic features yield slightly better efficacy than syntactic features for distinguishing the demography-specific reviews.","PeriodicalId":423113,"journal":{"name":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM55673.2022.10068612","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The demographic knowledge facilitates a fine-grained interpretation of the user-generated review text and enables better decision-making. In this study, we aim to com-prehend how various attributes of non-native English text vary across demographically distinct groups. We introduce a non-native English corpus of around 1150 reviews representing four demographically diverse country-specific groups: Finland, Kenya, Bangladesh, and China. The reviews differ in various contexts, including geography, native language family, race and culture, and English proficiency levels of the reviewers. We then perform stylometric and semantic analysis on these distinct sets of reviews to unveil how the linguistic characteristics differ across the demography. The investigation reveals that stylometric features are mostly similar across the reviews of various groups; nevertheless, dissimilarities are observed in attributes, such as review length, presence of articles, or prepositions. We employ classical machine learning (ML) algorithms and transformer-based fine-tuned language models for categorizing the reviews into distinct demographic groups. We observe that semantic features yield slightly better efficacy than syntactic features for distinguishing the demography-specific reviews.