The Insight-Inference Loop: Efficient Text Classification via Natural Language Inference and Threshold-Tuning

IF 6.5 2区 社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS
Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, Grégory Renard
{"title":"The Insight-Inference Loop: Efficient Text Classification via Natural Language Inference and Threshold-Tuning","authors":"Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, Grégory Renard","doi":"10.1177/00491241251326819","DOIUrl":null,"url":null,"abstract":"Modern computational text classification methods have brought social scientists tantalizingly close to the goal of unlocking vast insights buried in text data—from centuries of historical documents to streams of social media posts. Yet three barriers still stand in the way: the tedious labor of manual text annotation, the technical complexity that keeps these tools out of reach for many researchers, and, perhaps most critically, the challenge of bridging the gap between sophisticated algorithms and the deep theoretical understanding social scientists have already developed about human interactions, social structures, and institutions. To counter these limitations, we propose an approach to large-scale text analysis that requires substantially less human-labeled data, and no machine learning expertise, and efficiently integrates the social scientist into critical steps in the workflow. This approach, which allows the detection of statements in text, relies on large language models pre-trained for natural language inference, and a “few-shot” threshold-tuning algorithm rooted in active learning principles. We describe and showcase our approach by analyzing tweets collected during the 2020 U.S. presidential election campaign, and benchmark it against various computational approaches across three datasets.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"1 1","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sociological Methods & Research","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00491241251326819","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Modern computational text classification methods have brought social scientists tantalizingly close to the goal of unlocking vast insights buried in text data—from centuries of historical documents to streams of social media posts. Yet three barriers still stand in the way: the tedious labor of manual text annotation, the technical complexity that keeps these tools out of reach for many researchers, and, perhaps most critically, the challenge of bridging the gap between sophisticated algorithms and the deep theoretical understanding social scientists have already developed about human interactions, social structures, and institutions. To counter these limitations, we propose an approach to large-scale text analysis that requires substantially less human-labeled data, and no machine learning expertise, and efficiently integrates the social scientist into critical steps in the workflow. This approach, which allows the detection of statements in text, relies on large language models pre-trained for natural language inference, and a “few-shot” threshold-tuning algorithm rooted in active learning principles. We describe and showcase our approach by analyzing tweets collected during the 2020 U.S. presidential election campaign, and benchmark it against various computational approaches across three datasets.
洞察-推理循环:基于自然语言推理和阈值调优的高效文本分类
现代计算文本分类方法已经让社会科学家们悄然接近了揭开埋藏在文本数据--从数百年的历史文献到社交媒体帖子流--中的巨大洞察力的目标。然而,有三个障碍仍然阻碍着我们:人工文本注释的繁琐劳动、技术的复杂性使许多研究人员无法使用这些工具,而最关键的也许是,在复杂的算法与社会科学家对人类互动、社会结构和制度的深刻理论理解之间架起桥梁的挑战。为了克服这些局限性,我们提出了一种大规模文本分析方法,它大大减少了对人类标注数据的需求,也不需要机器学习方面的专业知识,而且能将社会科学家有效地整合到工作流程的关键步骤中。这种方法可以检测文本中的语句,依赖于为自然语言推理预先训练的大型语言模型,以及植根于主动学习原理的 "少量 "阈值调整算法。我们通过分析 2020 年美国总统竞选期间收集的推文来描述和展示我们的方法,并在三个数据集上与各种计算方法进行比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
16.30
自引率
3.20%
发文量
40
期刊介绍: Sociological Methods & Research is a quarterly journal devoted to sociology as a cumulative empirical science. The objectives of SMR are multiple, but emphasis is placed on articles that advance the understanding of the field through systematic presentations that clarify methodological problems and assist in ordering the known facts in an area. Review articles will be published, particularly those that emphasize a critical analysis of the status of the arts, but original presentations that are broadly based and provide new research will also be published. Intrinsically, SMR is viewed as substantive journal but one that is highly focused on the assessment of the scientific status of sociology. The scope is broad and flexible, and authors are invited to correspond with the editors about the appropriateness of their articles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信