基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较

IF 1.8 Q2 POLITICAL SCIENCE
M. Reveilhac, D. Morselli
{"title":"基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较","authors":"M. Reveilhac, D. Morselli","doi":"10.1080/2474736X.2022.2029217","DOIUrl":null,"url":null,"abstract":"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.","PeriodicalId":20269,"journal":{"name":"Political Research Exchange","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data\",\"authors\":\"M. Reveilhac, D. Morselli\",\"doi\":\"10.1080/2474736X.2022.2029217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.\",\"PeriodicalId\":20269,\"journal\":{\"name\":\"Political Research Exchange\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Political Research Exchange\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/2474736X.2022.2029217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"POLITICAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Political Research Exchange","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2474736X.2022.2029217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"POLITICAL SCIENCE","Score":null,"Total":0}
引用次数: 3

摘要

自动文本分析方法使得通过框架和调性等措施对大型文本语料库进行分类成为可能,在社会、政治和心理科学中越来越受欢迎。这些方法通常需要一个足够大的训练数据集来生成准确的模型,这些模型可以应用于未见过的文本。然而,在实践中,对于训练样本应该有多大并没有明确的建议。当处理偏向于分类的文本时,当研究人员无法负担大量注释文本的样本时,这个问题变得特别尖锐。利用支持民主的案例,我们提供了一个指南,帮助研究人员在从一小部分带注释的社交媒体帖子中产生调性和框架度量时做出决策。我们发现监督机器学习算法在调性分类任务上优于字典。然而,在识别社交媒体信息中潜在的民主维度时,自定义词典是这些算法的有用补充,特别是在精心设计这些词典的方法由词嵌入技术和人工验证指导的情况下。因此,我们提供了易于实现的建议,以提高非最优条件下的估计精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data
ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Political Research Exchange
Political Research Exchange POLITICAL SCIENCE-
CiteScore
3.40
自引率
0.00%
发文量
25
审稿时长
39 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信