{"title":"A Text Augmentation Approach using Similarity Measures based on Neural Sentence Embeddings for Emotion Classification on Microblogs","authors":"Yong Kuan Shyang, Jasy Liew Suet Yan","doi":"10.1109/IICAIET49801.2020.9257826","DOIUrl":null,"url":null,"abstract":"Machine learning models for fine-grained emotion classification can benefit from a larger pool of training data but manually expanding the emotion corpus for training is labor-intensive and time-consuming. While distant supervision provides a viable alternative, the self-labeled emotion corpus is susceptible to a high level of noise. This paper introduces a text augmentation method that can be used to efficiently expand the size of positive examples for the purpose of training by harnessing tweets collected from distant supervision (DS) that are similar to a small set of gold standard seed tweets. Tweets labeled with happiness in EmoTweet-28 (ET) are used as gold standard seeds to augment the training data to include similar DS tweets containing the happiness hashtags. Three pre-trained sentence encoders are used to encode the tweets into multidimensional vectors for similarity scoring between each DS:ET-seed pair. DS tweets with similarity scores exceeding a predefined threshold are added into an augmented set that is subsequently used to train a linear SVM classifier to distinguish between happiness and non-happiness. Our proposed text augmentation method proved to be a more effective approach that can leverage quality training data in larger quantities contributed by both carefully curated and distant supervision emotion corpora.","PeriodicalId":300885,"journal":{"name":"2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IICAIET49801.2020.9257826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Machine learning models for fine-grained emotion classification can benefit from a larger pool of training data but manually expanding the emotion corpus for training is labor-intensive and time-consuming. While distant supervision provides a viable alternative, the self-labeled emotion corpus is susceptible to a high level of noise. This paper introduces a text augmentation method that can be used to efficiently expand the size of positive examples for the purpose of training by harnessing tweets collected from distant supervision (DS) that are similar to a small set of gold standard seed tweets. Tweets labeled with happiness in EmoTweet-28 (ET) are used as gold standard seeds to augment the training data to include similar DS tweets containing the happiness hashtags. Three pre-trained sentence encoders are used to encode the tweets into multidimensional vectors for similarity scoring between each DS:ET-seed pair. DS tweets with similarity scores exceeding a predefined threshold are added into an augmented set that is subsequently used to train a linear SVM classifier to distinguish between happiness and non-happiness. Our proposed text augmentation method proved to be a more effective approach that can leverage quality training data in larger quantities contributed by both carefully curated and distant supervision emotion corpora.