Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale

Research & Politics Pub Date : 2024-01-01 DOI:10.1177/20531680241231468

Jonathan Mellon, J. Bailey, Ralph Scott, James Breckwoldt, Marta Miori, Phillip Schmedeman

{"title":"Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale","authors":"Jonathan Mellon, J. Bailey, Ralph Scott, James Breckwoldt, Marta Miori, Phillip Schmedeman","doi":"10.1177/20531680241231468","DOIUrl":null,"url":null,"abstract":"Can artificial intelligence accurately label open-text survey responses? We compare the accuracy of six large language models (LLMs) using a few-shot approach, three supervised learning algorithms (SVM, DistilRoBERTa, and a neural network trained on BERT embeddings), and a second human coder on the task of categorizing “most important issue” responses from the British Election Study Internet Panel into 50 categories. For the scenario where a researcher lacks existing training data, the accuracy of the highest-performing LLM (Claude-1.3: 93.9%) neared human performance (94.7%) and exceeded the highest-performing supervised approach trained on 1000 randomly sampled cases (neural network: 93.5%). In a scenario where previous data has been labeled but a researcher wants to label novel text, the best LLM’s (Claude-1.3: 80.9%) few-shot performance is only slightly behind the human (88.6%) and exceeds the best supervised model trained on 576,000 cases (DistilRoBERTa: 77.8%). PaLM-2, Llama-2, and the SVM all performed substantially worse than the best LLMs and supervised models across all metrics and scenarios. Our results suggest that LLMs may allow for greater use of open-ended survey questions in the future.","PeriodicalId":125693,"journal":{"name":"Research & Politics","volume":"214 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research & Politics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/20531680241231468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Can artificial intelligence accurately label open-text survey responses? We compare the accuracy of six large language models (LLMs) using a few-shot approach, three supervised learning algorithms (SVM, DistilRoBERTa, and a neural network trained on BERT embeddings), and a second human coder on the task of categorizing “most important issue” responses from the British Election Study Internet Panel into 50 categories. For the scenario where a researcher lacks existing training data, the accuracy of the highest-performing LLM (Claude-1.3: 93.9%) neared human performance (94.7%) and exceeded the highest-performing supervised approach trained on 1000 randomly sampled cases (neural network: 93.5%). In a scenario where previous data has been labeled but a researcher wants to label novel text, the best LLM’s (Claude-1.3: 80.9%) few-shot performance is only slightly behind the human (88.6%) and exceeds the best supervised model trained on 576,000 cases (DistilRoBERTa: 77.8%). PaLM-2, Llama-2, and the SVM all performed substantially worse than the best LLMs and supervised models across all metrics and scenarios. Our results suggest that LLMs may allow for greater use of open-ended survey questions in the future.

查看原文本刊更多论文

人工智能知道什么是最重要的问题吗？使用语言模型对开放文本社会调查回复进行大规模编码

人工智能能否准确标注开放文本调查回复？在将英国大选研究互联网小组的 "最重要问题 "回复分为 50 个类别的任务中，我们比较了六种大型语言模型（LLM）的准确性，这六种模型分别采用了少数几次拍摄方法、三种监督学习算法（SVM、DistilRoBERTa 和基于 BERT 内嵌训练的神经网络），以及第二名人工编码员的准确性。在研究人员缺乏现有训练数据的情况下，性能最高的 LLM（Claude-1.3：93.9%）的准确率接近人类（94.7%），并超过了在 1000 个随机抽样案例上训练的性能最高的监督方法（神经网络：93.5%）。在已对先前数据进行标注，但研究人员希望对新文本进行标注的情况下，最佳 LLM（Claude-1.3：80.9%）的少量标注性能仅略微落后于人类（88.6%），并超过了在 576,000 个案例上训练的最佳监督模型（DistilRoBERTa：77.8%）。在所有指标和场景中，PaLM-2、Llama-2 和 SVM 的表现都大大低于最佳 LLM 和监督模型。我们的研究结果表明，LLMs 可以在未来更多地使用开放式调查问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research & Politics

自引率

0.00%

发文量