Dataset Construction and Opinion Holder Detection Using Pre-trained Models

International journal of service and knowledge management Pub Date : 2023-01-01 DOI:10.52731/ijskm.v7.i2.779

Al- Mahmud, Kazutaka Shimada

{"title":"Dataset Construction and Opinion Holder Detection Using Pre-trained Models","authors":"Al- Mahmud, Kazutaka Shimada","doi":"10.52731/ijskm.v7.i2.779","DOIUrl":null,"url":null,"abstract":"With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.","PeriodicalId":487422,"journal":{"name":"International journal of service and knowledge management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of service and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52731/ijskm.v7.i2.779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.

查看原文本刊更多论文

基于预训练模型的数据集构建和意见持有者检测

随着互联网的日益普及，越来越多的人和实体在Facebook、Twitter、亚马逊等网络平台上发表意见。由于人工检测在线意见趋势变得越来越不可能，一种自动检测意见持有者的方法是必不可少的，因为它是识别特定主题、产品或问题的特定关注点的一种手段。意见持有人检测包括两个步骤:意见持有人在文本中的存在和意见持有人的识别。本研究考察了这两个步骤。最初，我们将此任务视为一个二元分类问题:INSIDE或OUTSIDE。然后，我们将意见持有人的识别视为一个序列标记任务，并准备了一个适当的英语数据集。随后，我们采用了三种预训练模型进行意见持有者检测任务:BERT、蒸馏BERT和上下文字符串嵌入(CSE)。对于二元分类任务，我们在BERT和蒸馏伯特模型的顶层采用逻辑回归模型。我们从F1分数和准确率两方面比较了模型的性能。实验结果表明，蒸馏酒获得了较好的性能，F1得分为0.901，准确率为0.924。对于意见持有者识别任务，我们同时使用基于特征和基于微调的体系结构。此外，我们将CSE和条件随机场(CRF)与BERT和DistilBERT相结合。对于基于特征的体系结构，我们使用了五个模型:CSE+CRF, BERT+CRF， (BERT&CSE)+CRF，蒸馏器+CRF和(蒸馏器&CSE)+CRF。对于基于微调的架构，我们使用了六个模型:BERT, BERT+CRF， (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF和(DistilBERT&CSE)+CRF。所有语言模型都是根据F1分数和处理时间进行评估的。实验结果表明，基于特征和基于微调的(DistilBERT&CSE)+CRF模型共同获得了最优的性能，F1得分为0.9453。然而，基于特征的CSE+CRF所需的处理时间最短，为49 s，同时产生的F1分数与性能最优的模型相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International journal of service and knowledge management

自引率

0.00%

发文量