A Text Augmentation Approach using Similarity Measures based on Neural Sentence Embeddings for Emotion Classification on Microblogs

Yong Kuan Shyang, Jasy Liew Suet Yan
{"title":"A Text Augmentation Approach using Similarity Measures based on Neural Sentence Embeddings for Emotion Classification on Microblogs","authors":"Yong Kuan Shyang, Jasy Liew Suet Yan","doi":"10.1109/IICAIET49801.2020.9257826","DOIUrl":null,"url":null,"abstract":"Machine learning models for fine-grained emotion classification can benefit from a larger pool of training data but manually expanding the emotion corpus for training is labor-intensive and time-consuming. While distant supervision provides a viable alternative, the self-labeled emotion corpus is susceptible to a high level of noise. This paper introduces a text augmentation method that can be used to efficiently expand the size of positive examples for the purpose of training by harnessing tweets collected from distant supervision (DS) that are similar to a small set of gold standard seed tweets. Tweets labeled with happiness in EmoTweet-28 (ET) are used as gold standard seeds to augment the training data to include similar DS tweets containing the happiness hashtags. Three pre-trained sentence encoders are used to encode the tweets into multidimensional vectors for similarity scoring between each DS:ET-seed pair. DS tweets with similarity scores exceeding a predefined threshold are added into an augmented set that is subsequently used to train a linear SVM classifier to distinguish between happiness and non-happiness. Our proposed text augmentation method proved to be a more effective approach that can leverage quality training data in larger quantities contributed by both carefully curated and distant supervision emotion corpora.","PeriodicalId":300885,"journal":{"name":"2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IICAIET49801.2020.9257826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Machine learning models for fine-grained emotion classification can benefit from a larger pool of training data but manually expanding the emotion corpus for training is labor-intensive and time-consuming. While distant supervision provides a viable alternative, the self-labeled emotion corpus is susceptible to a high level of noise. This paper introduces a text augmentation method that can be used to efficiently expand the size of positive examples for the purpose of training by harnessing tweets collected from distant supervision (DS) that are similar to a small set of gold standard seed tweets. Tweets labeled with happiness in EmoTweet-28 (ET) are used as gold standard seeds to augment the training data to include similar DS tweets containing the happiness hashtags. Three pre-trained sentence encoders are used to encode the tweets into multidimensional vectors for similarity scoring between each DS:ET-seed pair. DS tweets with similarity scores exceeding a predefined threshold are added into an augmented set that is subsequently used to train a linear SVM classifier to distinguish between happiness and non-happiness. Our proposed text augmentation method proved to be a more effective approach that can leverage quality training data in larger quantities contributed by both carefully curated and distant supervision emotion corpora.
基于神经句嵌入的微博情感分类相似度增强方法
用于细粒度情感分类的机器学习模型可以从更大的训练数据池中受益,但手动扩展用于训练的情感语料库是劳动密集型和耗时的。虽然远程监督提供了一个可行的选择,但自我标记的情感语料库容易受到高水平噪音的影响。本文介绍了一种文本增强方法,该方法可以通过利用从远程监督(DS)收集的推文来有效地扩展用于训练目的的正例的大小,这些推文类似于一小组金标准种子推文。在EmoTweet-28 (ET)中标记为幸福的推文被用作金标准种子来增强训练数据,以包括包含幸福标签的类似DS推文。使用三个预训练的句子编码器将推文编码成多维向量,用于DS: et种子对之间的相似性评分。相似度得分超过预定义阈值的DS推文被添加到增强集中,该增强集随后用于训练线性SVM分类器来区分快乐和不快乐。我们提出的文本增强方法被证明是一种更有效的方法,可以利用精心策划和远程监督情感语料库提供的大量高质量训练数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信