From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models

IF 6.5 2区 社会学 Q1 SOCIAL SCIENCES, MATHEMATICAL METHODS
Oscar Stuhler, Cat Dang Ton, Etienne Ollion
{"title":"From Codebooks to Promptbooks: Extracting Information from Text with Generative Large Language Models","authors":"Oscar Stuhler, Cat Dang Ton, Etienne Ollion","doi":"10.1177/00491241251336794","DOIUrl":null,"url":null,"abstract":"Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.","PeriodicalId":21849,"journal":{"name":"Sociological Methods & Research","volume":"20 1","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sociological Methods & Research","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00491241251336794","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Generative AI (GenAI) is quickly becoming a valuable tool for sociological research. Already, sociologists employ GenAI for tasks like classifying text and simulating human agents. We point to another major use case: the extraction of structured information from unstructured text. Information Extraction (IE) is an established branch of Natural Language Processing, but leveraging the affordances of this paradigm has thus far required familiarity with specialized models. GenAI changes this by allowing researchers to define their own IE tasks and execute them via targeted prompts. This article explores the potential of open-source large language models for IE by extracting and encoding biographical information (e.g., age, occupation, origin) from a corpus of newspaper obituaries. As we proceed, we discuss how sociologists can develop and evaluate prompt architectures for such tasks, turning codebooks into “promptbooks.” We also evaluate models of different sizes and prompting techniques. Our analysis showcases the potential of GenAI as a flexible and accessible tool for IE while also underscoring risks like non-random error patterns that can bias downstream analyses.
从代码本到提示本:用生成式大型语言模型从文本中提取信息
生成式人工智能(GenAI)正迅速成为社会学研究的一个有价值的工具。社会学家已经在使用GenAI来完成文本分类和模拟人类代理等任务。我们指出另一个主要用例:从非结构化文本中提取结构化信息。信息提取(IE)是自然语言处理的一个已建立的分支,但是利用这种范式的功能迄今为止需要熟悉专门的模型。GenAI改变了这一点,它允许研究人员定义自己的IE任务,并通过有针对性的提示执行这些任务。本文通过从报纸讣告语料库中提取和编码传记信息(例如,年龄、职业、出身),探索了开源大型语言模型在IE中的潜力。在我们继续讨论的过程中,我们将讨论社会学家如何为这些任务开发和评估提示架构,将代码本变成“提示本”。我们还评估了不同大小的模型和提示技术。我们的分析显示了GenAI作为一种灵活且易于使用的IE工具的潜力,同时也强调了非随机错误模式等风险,这些错误模式可能会影响下游分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
16.30
自引率
3.20%
发文量
40
期刊介绍: Sociological Methods & Research is a quarterly journal devoted to sociology as a cumulative empirical science. The objectives of SMR are multiple, but emphasis is placed on articles that advance the understanding of the field through systematic presentations that clarify methodological problems and assist in ordering the known facts in an area. Review articles will be published, particularly those that emphasize a critical analysis of the status of the arts, but original presentations that are broadly based and provide new research will also be published. Intrinsically, SMR is viewed as substantive journal but one that is highly focused on the assessment of the scientific status of sociology. The scope is broad and flexible, and authors are invited to correspond with the editors about the appropriateness of their articles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信