{"title":"基于预训练模型的数据集构建和意见持有者检测","authors":"Al- Mahmud, Kazutaka Shimada","doi":"10.52731/ijskm.v7.i2.779","DOIUrl":null,"url":null,"abstract":"With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.","PeriodicalId":487422,"journal":{"name":"International journal of service and knowledge management","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dataset Construction and Opinion Holder Detection Using Pre-trained Models\",\"authors\":\"Al- Mahmud, Kazutaka Shimada\",\"doi\":\"10.52731/ijskm.v7.i2.779\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.\",\"PeriodicalId\":487422,\"journal\":{\"name\":\"International journal of service and knowledge management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of service and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.52731/ijskm.v7.i2.779\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of service and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52731/ijskm.v7.i2.779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dataset Construction and Opinion Holder Detection Using Pre-trained Models
With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.