The Insight-Inference Loop: Efficient Text Classification via Natural Language Inference and Threshold-Tuning

IF 6.5 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research Pub Date : 2025-04-19 DOI:10.1177/00491241251326819

Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, Grégory Renard

{"title":"The Insight-Inference Loop: Efficient Text Classification via Natural Language Inference and Threshold-Tuning","authors":"Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, Grégory Renard","doi":"10.1177/00491241251326819","DOIUrl":null,"url":null,"abstract":"Modern computational text classification methods have brought social scientists tantalizingly close to the goal of unlocking vast insights buried in text data—from centuries of historical documents to streams of social media posts. Yet three barriers still stand in the way: the tedious labor of manual text annotation, the technical complexity that keeps these tools out of reach for many researchers, and, perhaps most critically, the challenge of bridging the gap between sophisticated algorithms and the deep theoretical understanding social scientists have already developed about human interactions, social structures, and institutions. To counter these limitations, we propose an approach to large-scale text analysis that requires substantially less human-labeled data, and no machine learning expertise, and efficiently integrates the social scientist into critical steps in the workflow. This approach, which allows the detection of statements in text, relies on large language models pre-trained for natural language inference, and a “few-shot” threshold-tuning algorithm rooted in active learning principles. We describe and showcase our approach by analyzing tweets collected during the 2020 U.S. presidential election campaign, and benchmark it against various computational approaches across three datasets.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"1 1","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sociological Methods & Research","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00491241251326819","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Modern computational text classification methods have brought social scientists tantalizingly close to the goal of unlocking vast insights buried in text data—from centuries of historical documents to streams of social media posts. Yet three barriers still stand in the way: the tedious labor of manual text annotation, the technical complexity that keeps these tools out of reach for many researchers, and, perhaps most critically, the challenge of bridging the gap between sophisticated algorithms and the deep theoretical understanding social scientists have already developed about human interactions, social structures, and institutions. To counter these limitations, we propose an approach to large-scale text analysis that requires substantially less human-labeled data, and no machine learning expertise, and efficiently integrates the social scientist into critical steps in the workflow. This approach, which allows the detection of statements in text, relies on large language models pre-trained for natural language inference, and a “few-shot” threshold-tuning algorithm rooted in active learning principles. We describe and showcase our approach by analyzing tweets collected during the 2020 U.S. presidential election campaign, and benchmark it against various computational approaches across three datasets.

查看原文本刊更多论文

洞察-推理循环：基于自然语言推理和阈值调优的高效文本分类

现代计算文本分类方法已经让社会科学家们悄然接近了揭开埋藏在文本数据--从数百年的历史文献到社交媒体帖子流--中的巨大洞察力的目标。然而，有三个障碍仍然阻碍着我们：人工文本注释的繁琐劳动、技术的复杂性使许多研究人员无法使用这些工具，而最关键的也许是，在复杂的算法与社会科学家对人类互动、社会结构和制度的深刻理论理解之间架起桥梁的挑战。为了克服这些局限性，我们提出了一种大规模文本分析方法，它大大减少了对人类标注数据的需求，也不需要机器学习方面的专业知识，而且能将社会科学家有效地整合到工作流程的关键步骤中。这种方法可以检测文本中的语句，依赖于为自然语言推理预先训练的大型语言模型，以及植根于主动学习原理的 "少量 "阈值调整算法。我们通过分析 2020 年美国总统竞选期间收集的推文来描述和展示我们的方法，并在三个数据集上与各种计算方法进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sociological Methods & Research Multiple-

CiteScore

16.30

自引率

3.20%

发文量

期刊介绍： Sociological Methods & Research is a quarterly journal devoted to sociology as a cumulative empirical science. The objectives of SMR are multiple, but emphasis is placed on articles that advance the understanding of the field through systematic presentations that clarify methodological problems and assist in ordering the known facts in an area. Review articles will be published, particularly those that emphasize a critical analysis of the status of the arts, but original presentations that are broadly based and provide new research will also be published. Intrinsically, SMR is viewed as substantive journal but one that is highly focused on the assessment of the scientific status of sociology. The scope is broad and flexible, and authors are invited to correspond with the editors about the appropriateness of their articles.