使用预训练和自训练词嵌入对印度尼西亚Twitter上COVID-19疫苗的情绪分析

Jurnal Ilmu Komputer dan Informasi Pub Date : 2022-02-27 DOI:10.21609/jiki.v15i1.1044

Kartikasari Kusuma Agustiningsih, Ema Utami, Muhammad Altoumi Alsyaibani

{"title":"使用预训练和自训练词嵌入对印度尼西亚Twitter上COVID-19疫苗的情绪分析","authors":"Kartikasari Kusuma Agustiningsih, Ema Utami, Muhammad Altoumi Alsyaibani","doi":"10.21609/jiki.v15i1.1044","DOIUrl":null,"url":null,"abstract":"Sentiment analysis regarding the COVID-19 vaccine can be obtained from social media because users usually express their opinions through social media. One of the social media that is most often used by Indonesian people to express their opinion is Twitter. The method used in this research is Bidirectional LSTM which will be combined with word embedding. In this study, fastText and GloVe were tested as word embedding. We created 8 test scenarios to inspect performance of the word embeddings, using both pre-trained and self-trained word embedding vectors. Dataset gathered from Twitter was prepared as stemmed dataset and unstemmed dataset. The highest accuracy from GloVe scenario group was generated by model which used self-trained GloVe and trained on unstemmed dataset. The accuracy reached 92.5%. On the other hand, the highest accuracy from fastText scenario group generated by model which used self-trained fastText and trained on stemmed dataset. The accuracy reached 92.3%. In other scenarios that used pre-trained embedding vector, the accuracy was quite lower than scenarios that used self-trained embedding vector, because the pre-trained embedding data was trained using the Wikipedia corpus which contains standard and well-structured language while the dataset used in this study came from Twitter which contains non-standard sentences. Even though the dataset was processed using stemming and slang words dictionary, the pre-trained embedding still can not recognize several words from our dataset.","PeriodicalId":31392,"journal":{"name":"Jurnal Ilmu Komputer dan Informasi","volume":"78 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Sentiment Analysis of COVID-19 Vaccines in Indonesia on Twitter Using Pre-Trained and Self-Training Word Embeddings\",\"authors\":\"Kartikasari Kusuma Agustiningsih, Ema Utami, Muhammad Altoumi Alsyaibani\",\"doi\":\"10.21609/jiki.v15i1.1044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis regarding the COVID-19 vaccine can be obtained from social media because users usually express their opinions through social media. One of the social media that is most often used by Indonesian people to express their opinion is Twitter. The method used in this research is Bidirectional LSTM which will be combined with word embedding. In this study, fastText and GloVe were tested as word embedding. We created 8 test scenarios to inspect performance of the word embeddings, using both pre-trained and self-trained word embedding vectors. Dataset gathered from Twitter was prepared as stemmed dataset and unstemmed dataset. The highest accuracy from GloVe scenario group was generated by model which used self-trained GloVe and trained on unstemmed dataset. The accuracy reached 92.5%. On the other hand, the highest accuracy from fastText scenario group generated by model which used self-trained fastText and trained on stemmed dataset. The accuracy reached 92.3%. In other scenarios that used pre-trained embedding vector, the accuracy was quite lower than scenarios that used self-trained embedding vector, because the pre-trained embedding data was trained using the Wikipedia corpus which contains standard and well-structured language while the dataset used in this study came from Twitter which contains non-standard sentences. Even though the dataset was processed using stemming and slang words dictionary, the pre-trained embedding still can not recognize several words from our dataset.\",\"PeriodicalId\":31392,\"journal\":{\"name\":\"Jurnal Ilmu Komputer dan Informasi\",\"volume\":\"78 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Ilmu Komputer dan Informasi\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21609/jiki.v15i1.1044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Ilmu Komputer dan Informasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21609/jiki.v15i1.1044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

对新冠病毒疫苗的情绪分析可以从社交媒体中获得，因为用户通常通过社交媒体表达自己的意见。印尼人最常用来表达意见的社交媒体之一是Twitter。本研究使用的方法是双向LSTM，并将其与词嵌入相结合。在本研究中，fastText和GloVe作为词嵌入进行了测试。我们创建了8个测试场景来检查词嵌入的性能，使用预训练和自训练的词嵌入向量。从Twitter收集的数据集准备为有梗数据集和无梗数据集。在GloVe场景组中，使用自训练的GloVe和在无梗数据集上训练的模型产生的准确率最高。准确率达到92.5%。另一方面，使用自训练fastText并在主干数据集上训练的模型生成的fastText场景组准确率最高。准确率达到92.3%。在其他使用预训练嵌入向量的场景中，准确率远低于使用自训练嵌入向量的场景，因为预训练的嵌入数据是使用Wikipedia语料库训练的，其中包含标准且结构良好的语言，而本研究使用的数据集来自Twitter，其中包含非标准句子。尽管使用词干提取和俚语词典对数据集进行了处理，但预训练的嵌入仍然不能识别出我们数据集中的几个词。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sentiment Analysis of COVID-19 Vaccines in Indonesia on Twitter Using Pre-Trained and Self-Training Word Embeddings

Sentiment analysis regarding the COVID-19 vaccine can be obtained from social media because users usually express their opinions through social media. One of the social media that is most often used by Indonesian people to express their opinion is Twitter. The method used in this research is Bidirectional LSTM which will be combined with word embedding. In this study, fastText and GloVe were tested as word embedding. We created 8 test scenarios to inspect performance of the word embeddings, using both pre-trained and self-trained word embedding vectors. Dataset gathered from Twitter was prepared as stemmed dataset and unstemmed dataset. The highest accuracy from GloVe scenario group was generated by model which used self-trained GloVe and trained on unstemmed dataset. The accuracy reached 92.5%. On the other hand, the highest accuracy from fastText scenario group generated by model which used self-trained fastText and trained on stemmed dataset. The accuracy reached 92.3%. In other scenarios that used pre-trained embedding vector, the accuracy was quite lower than scenarios that used self-trained embedding vector, because the pre-trained embedding data was trained using the Wikipedia corpus which contains standard and well-structured language while the dataset used in this study came from Twitter which contains non-standard sentences. Even though the dataset was processed using stemming and slang words dictionary, the pre-trained embedding still can not recognize several words from our dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jurnal Ilmu Komputer dan Informasi

自引率

0.00%

发文量

审稿时长

4 weeks