From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models

IF 6.5 2区社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS

Sociological Methods & Research Pub Date : 2025-06-25 DOI:10.1177/00491241251336794

Oscar Stuhler, Cat Dang Ton, Etienne Ollion

{"title":"From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models","authors":"Oscar Stuhler, Cat Dang Ton, Etienne Ollion","doi":"10.1177/00491241251336794","DOIUrl":null,"url":null,"abstract":"Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"20 1","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sociological Methods & Research","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00491241251336794","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.

查看原文本刊更多论文

从代码本到提示本：用生成式大型语言模型从文本中提取信息

生成式人工智能（GenAI）正迅速成为社会学研究的一个有价值的工具。社会学家已经在使用GenAI来完成文本分类和模拟人类代理等任务。我们指出另一个主要用例：从非结构化文本中提取结构化信息。信息提取（IE）是自然语言处理的一个已建立的分支，但是利用这种范式的功能迄今为止需要熟悉专门的模型。GenAI改变了这一点，它允许研究人员定义自己的IE任务，并通过有针对性的提示执行这些任务。本文通过从报纸讣告语料库中提取和编码传记信息（例如，年龄、职业、出身），探索了开源大型语言模型在IE中的潜力。在我们继续讨论的过程中，我们将讨论社会学家如何为这些任务开发和评估提示架构，将代码本变成“提示本”。我们还评估了不同大小的模型和提示技术。我们的分析显示了GenAI作为一种灵活且易于使用的IE工具的潜力，同时也强调了非随机错误模式等风险，这些错误模式可能会影响下游分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sociological Methods & Research Multiple-

CiteScore

16.30

自引率

3.20%

发文量

期刊介绍： Sociological Methods & Research is a quarterly journal devoted to sociology as a cumulative empirical science. The objectives of SMR are multiple, but emphasis is placed on articles that advance the understanding of the field through systematic presentations that clarify methodological problems and assist in ordering the known facts in an area. Review articles will be published, particularly those that emphasize a critical analysis of the status of the arts, but original presentations that are broadly based and provide new research will also be published. Intrinsically, SMR is viewed as substantive journal but one that is highly focused on the assessment of the scientific status of sociology. The scope is broad and flexible, and authors are invited to correspond with the editors about the appropriateness of their articles.