情感识别的语言嵌入。

IF 2.5 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science Pub Date : 2023-01-01 Epub Date: 2023-11-16 DOI:10.1140/epjds/s13688-023-00427-0

Segun Taofeek Aroyehun, Lukas Malik, Hannah Metzler, Nikolas Haimerl, Anna Di Natale, David Garcia

{"title":"情感识别的语言嵌入。","authors":"Segun Taofeek Aroyehun, Lukas Malik, Hannah Metzler, Nikolas Haimerl, Anna Di Natale, David Garcia","doi":"10.1140/epjds/s13688-023-00427-0","DOIUrl":null,"url":null,"abstract":"The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"52"},"PeriodicalIF":2.5000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10654159/pdf/","citationCount":"0","resultStr":"{\"title\":\"LEIA: Linguistic Embeddings for the Identification of Affect.\",\"authors\":\"Segun Taofeek Aroyehun, Lukas Malik, Hannah Metzler, Nikolas Haimerl, Anna Di Natale, David Garcia\",\"doi\":\"10.1140/epjds/s13688-023-00427-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.\",\"PeriodicalId\":11887,\"journal\":{\"name\":\"EPJ Data Science\",\"volume\":\"12 1\",\"pages\":\"52\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10654159/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EPJ Data Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1140/epjds/s13688-023-00427-0\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/11/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EPJ Data Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1140/epjds/s13688-023-00427-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/16 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体产生的大量文本数据使得用语言模型分析情绪成为可能。这些模型通常是在小而昂贵的文本注释数据集上进行训练的，这些数据集是由读者在社交媒体帖子中猜测他人表达的情绪而产生的。由于训练数据大小的限制和模型开发中使用的标签生产中的噪声，这影响了情绪识别方法的质量。我们提出了LEIA，这是一个用于文本情感识别的模型，该模型已经在超过600万篇文章的数据集上进行了训练，这些帖子具有自我注释的情绪标签，包括快乐、情感、悲伤、愤怒和恐惧。LEIA是一种基于词掩蔽的方法，该方法在模型预训练过程中增强了情绪词的学习。LEIA在三个域内测试数据集上实现了大约73的宏f1值，在一个强大的基准测试中优于其他有监督和无监督的方法，这表明LEIA泛化了帖子、用户和时间段。我们进一步对社交媒体和其他来源的五个不同数据集进行了域外评估，显示了LEIA在媒体、数据收集方法和注释方案上的稳健性能。我们的结果表明，LEIA将其对愤怒、快乐和悲伤的分类推广到了它所训练的领域之外。LEIA可以应用于未来的研究，从作者的角度更好地识别文本中的情绪。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

LEIA: Linguistic Embeddings for the Identification of Affect.

查看原文本刊更多论文

LEIA: Linguistic Embeddings for the Identification of Affect.

The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EPJ Data Science MATHEMATICS, INTERDISCIPLINARY APPLICATIONS -

CiteScore

6.10

自引率

5.60%

发文量

审稿时长

13 weeks

期刊介绍： EPJ Data Science covers a broad range of research areas and applications and particularly encourages contributions from techno-socio-economic systems, where it comprises those research lines that now regard the digital “tracks” of human beings as first-order objects for scientific investigation. Topics include, but are not limited to, human behavior, social interaction (including animal societies), economic and financial systems, management and business networks, socio-technical infrastructure, health and environmental systems, the science of science, as well as general risk and crisis scenario forecasting up to and including policy advice.