{"title":"A Review on Sentiment Analysis Model for Chinese Weibo Text","authors":"Dawei Wang, Rayner Alfred","doi":"10.1109/AEMCSE50948.2020.00105","DOIUrl":null,"url":null,"abstract":"the technology of sentiment analysis about Chinese Weibo text is a complex and systematic model. In general situation, it includes 3 parts: data washing, word segmentation and feature extraction. Weibo text is an unstructured text and there are many non-standard contents in it. Therefore, it should be thoroughly data washing before feature extraction. Due to emoticon in Weibo text are very useful in sentiment analysis, thus, in data washing, all of Non-Chinese, with \"@\",\"#\" character should be removed except emoticon. In word segmentation, related algorithms can be divided into three categories: based on string matching, based on understand and based on statistics[1]. In feature extraction, the Lexicon-based Model, Machine learning Model and deep learning Model usually was used. Through literature search, the paper found that grammar characteristic in Chinese Weibo text was fully considered and solved by program of Lexicon-based Model, sentiment word, for example, adverb of degree, no word and all kinds of Chinese sentence patterns. But, due to characteristic of poor generalization, the performance of Lexicon-based Model in experiment is not good. Therefore, performance the model should be continued to improve. For traditional machine learning, there are 2 mainly aspects of innovation: Simultaneous classifier (Adoboost+SVM) and Improvement of classical classification algorithm. One worth noted is that the performance of the some improve classifier (SVM, P Naïve Bayes) has not been verified in Chinese Weibo classification. For deep learning, now, the innovation point is mainly focus on Convolution layer and input attention mechanism. For the next step, YuanHejin think should input ensemble learning and attention mechanism should be improve. LuXin argue that the recognition performance about irony sentence with context in Weibo needs to improve. GaoWeiju think that individual sentiment space for each user in EMCNN model should be build.","PeriodicalId":246841,"journal":{"name":"2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","volume":"62 8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEMCSE50948.2020.00105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
the technology of sentiment analysis about Chinese Weibo text is a complex and systematic model. In general situation, it includes 3 parts: data washing, word segmentation and feature extraction. Weibo text is an unstructured text and there are many non-standard contents in it. Therefore, it should be thoroughly data washing before feature extraction. Due to emoticon in Weibo text are very useful in sentiment analysis, thus, in data washing, all of Non-Chinese, with "@","#" character should be removed except emoticon. In word segmentation, related algorithms can be divided into three categories: based on string matching, based on understand and based on statistics[1]. In feature extraction, the Lexicon-based Model, Machine learning Model and deep learning Model usually was used. Through literature search, the paper found that grammar characteristic in Chinese Weibo text was fully considered and solved by program of Lexicon-based Model, sentiment word, for example, adverb of degree, no word and all kinds of Chinese sentence patterns. But, due to characteristic of poor generalization, the performance of Lexicon-based Model in experiment is not good. Therefore, performance the model should be continued to improve. For traditional machine learning, there are 2 mainly aspects of innovation: Simultaneous classifier (Adoboost+SVM) and Improvement of classical classification algorithm. One worth noted is that the performance of the some improve classifier (SVM, P Naïve Bayes) has not been verified in Chinese Weibo classification. For deep learning, now, the innovation point is mainly focus on Convolution layer and input attention mechanism. For the next step, YuanHejin think should input ensemble learning and attention mechanism should be improve. LuXin argue that the recognition performance about irony sentence with context in Weibo needs to improve. GaoWeiju think that individual sentiment space for each user in EMCNN model should be build.
中文微博文本情感分析技术是一个复杂的系统模型。一般情况下,它包括三个部分:数据清洗、分词和特征提取。微博文本是一种非结构化文本,其中存在着许多非标准内容。因此,在特征提取之前,应该进行彻底的数据清洗。由于微博文本中的表情符号在情感分析中非常有用,因此,在数据清洗中,除表情符号外,所有非中文、带有“@”、“#”字符的表情符号都应删除。在分词中,相关算法可分为三类:基于字符串匹配的、基于理解的和基于统计的[1]。在特征提取中,通常使用基于词典的模型、机器学习模型和深度学习模型。通过文献检索,本文发现基于lexicon Model的程序充分考虑并解决了中文微博文本的语法特点,如情况词、程度副词、无词以及各种汉语句式。但是,基于词典的模型由于泛化能力差的特点,在实验中表现不佳。因此,该模型的性能还应不断改进。对于传统的机器学习,主要有两个方面的创新:同步分类器(Adoboost+SVM)和对经典分类算法的改进。值得注意的是,一些改进的分类器(SVM, P Naïve Bayes)的性能尚未在中文微博分类中得到验证。对于深度学习,目前的创新点主要集中在卷积层和输入注意机制上。对于下一步,袁河金认为应投入集成学习,注意机制有待完善。卢新认为,微博对带有语境的反讽句的识别性能有待提高。高伟举认为,在EMCNN模型中应该为每个用户建立个人情感空间。