The use of large language models for qualitative research: The Deep Computational Text Analyser (DECOTA).

IF 7.8 1区心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY

Psychological methods Pub Date : 2025-04-07 DOI:10.1037/met0000753

Lois Player, Ryan Hughes, Kaloyan Mitev, Lorraine Whitmarsh, Christina Demski, Nicholas Nash, Trisevgeni Papakonstantinou, Mark Wilson

{"title":"The use of large language models for qualitative research: The Deep Computational Text Analyser (DECOTA).","authors":"Lois Player, Ryan Hughes, Kaloyan Mitev, Lorraine Whitmarsh, Christina Demski, Nicholas Nash, Trisevgeni Papakonstantinou, Mark Wilson","doi":"10.1037/met0000753","DOIUrl":null,"url":null,"abstract":"<p><p>Machine-assisted approaches for free-text analysis are rising in popularity, owing to a growing need to rapidly analyze large volumes of qualitative data. In both research and policy settings, these approaches have promise in providing timely insights into public perceptions and enabling policymakers to understand their community's needs. However, current approaches still require expert human interpretation-posing a financial and practical barrier for those outside of academia. For the first time, we propose and validate the Deep Computational Text Analyser (DECOTA)-a novel machine learning methodology that automatically analyzes large free-text data sets and outputs concise themes. Building on structural topic modeling approaches, we used two fine-tuned large language models and sentence transformers to automatically derive \"codes\" and their corresponding \"themes\", as in inductive thematic analysis. To fully automate the process, we designed and validated a novel algorithm to choose the optimal number of \"topics\" for the structural topic modeling. DECOTA outputs key codes and themes, their prevalence, and how prevalence varies across covariates such as age and gender. Each code is accompanied by three representative quotes. Four data sets previously analyzed using thematic analysis were triangulated with DECOTA's codes and themes. We found that DECOTA is approximately 378 times faster and 1,920 times cheaper than human coding and consistently yields codes in agreement with or complementary to human coding (averaging 91.6% for codes and 90% for themes). The implications for evidence-based policy development, public engagement with policymaking, and psychometric measure development are discussed. (PsycInfo Database Record (c) 2025 APA, all rights reserved).</p>","PeriodicalId":20782,"journal":{"name":"Psychological methods","volume":" ","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/met0000753","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Machine-assisted approaches for free-text analysis are rising in popularity, owing to a growing need to rapidly analyze large volumes of qualitative data. In both research and policy settings, these approaches have promise in providing timely insights into public perceptions and enabling policymakers to understand their community's needs. However, current approaches still require expert human interpretation-posing a financial and practical barrier for those outside of academia. For the first time, we propose and validate the Deep Computational Text Analyser (DECOTA)-a novel machine learning methodology that automatically analyzes large free-text data sets and outputs concise themes. Building on structural topic modeling approaches, we used two fine-tuned large language models and sentence transformers to automatically derive "codes" and their corresponding "themes", as in inductive thematic analysis. To fully automate the process, we designed and validated a novel algorithm to choose the optimal number of "topics" for the structural topic modeling. DECOTA outputs key codes and themes, their prevalence, and how prevalence varies across covariates such as age and gender. Each code is accompanied by three representative quotes. Four data sets previously analyzed using thematic analysis were triangulated with DECOTA's codes and themes. We found that DECOTA is approximately 378 times faster and 1,920 times cheaper than human coding and consistently yields codes in agreement with or complementary to human coding (averaging 91.6% for codes and 90% for themes). The implications for evidence-based policy development, public engagement with policymaking, and psychometric measure development are discussed. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

查看原文本刊更多论文

在定性研究中使用大型语言模型：深度计算文本分析器（DECOTA）。

由于快速分析大量定性数据的需求日益增长，用于自由文本分析的机器辅助方法越来越受欢迎。在研究和政策制定方面，这些方法有望及时洞察公众的看法，并使决策者能够了解其社区的需求。然而，目前的方法仍然需要专家的人工解释，这给学术界以外的人带来了经济和实践上的障碍。我们首次提出并验证了深度计算文本分析器（DECOTA）——一种新颖的机器学习方法，可以自动分析大型自由文本数据集并输出简洁的主题。在结构化主题建模方法的基础上，我们使用了两个微调的大型语言模型和句子转换器来自动导出“代码”及其对应的“主题”，就像归纳主题分析一样。为了使这一过程完全自动化，我们设计并验证了一种新的算法来选择结构主题建模的最佳“主题”数量。DECOTA输出关键代码和主题、它们的流行程度，以及流行程度在年龄和性别等协变量之间的变化情况。每个代码都有三个代表性的引号。先前使用主题分析分析的四个数据集与DECOTA的代码和主题进行了三角测量。我们发现DECOTA比人工编码快378倍，便宜1920倍，并且始终产生与人类编码一致或互补的代码（代码平均为91.6%，主题平均为90%）。讨论了基于证据的政策制定、公众参与政策制定和心理测量发展的含义。（PsycInfo Database Record (c) 2025 APA，版权所有）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Psychological methods PSYCHOLOGY, MULTIDISCIPLINARY-

CiteScore

13.10

自引率

7.10%

发文量

159

期刊介绍： Psychological Methods is devoted to the development and dissemination of methods for collecting, analyzing, understanding, and interpreting psychological data. Its purpose is the dissemination of innovations in research design, measurement, methodology, and quantitative and qualitative analysis to the psychological community; its further purpose is to promote effective communication about related substantive and methodological issues. The audience is expected to be diverse and to include those who develop new procedures, those who are responsible for undergraduate and graduate training in design, measurement, and statistics, as well as those who employ those procedures in research.