潜在代码识别(LACOID):一个基于机器学习的集成框架[和开源软件],用于对大文本数据进行分类,重建上下文化/未改变的含义,并避免聚合偏差

IF 3.9 2区 社会学 Q1 SOCIAL SCIENCES, INTERDISCIPLINARY
Manuel S. González Canché
{"title":"潜在代码识别(LACOID):一个基于机器学习的集成框架[和开源软件],用于对大文本数据进行分类,重建上下文化/未改变的含义,并避免聚合偏差","authors":"Manuel S. González Canché","doi":"10.1177/16094069221144940","DOIUrl":null,"url":null,"abstract":"Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).","PeriodicalId":48220,"journal":{"name":"International Journal of Qualitative Methods","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias\",\"authors\":\"Manuel S. González Canché\",\"doi\":\"10.1177/16094069221144940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).\",\"PeriodicalId\":48220,\"journal\":{\"name\":\"International Journal of Qualitative Methods\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2023-01-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Qualitative Methods\",\"FirstCategoryId\":\"90\",\"ListUrlMain\":\"https://doi.org/10.1177/16094069221144940\",\"RegionNum\":2,\"RegionCategory\":\"社会学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"SOCIAL SCIENCES, INTERDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Qualitative Methods","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/16094069221144940","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}
引用次数: 5

摘要

对文本数据和定性证据进行标记或分类是一项代价高昂且后果严重的挑战。这些标签构建背后的严谨性和一致性最终形成了研究结果和结论。解决这一挑战的一个多方面方法难题是,需要人类对分类进行推理,从而产生更深入、更细致的理解;然而,同样的人工分类也伴随着分类不一致和错误的增加,尤其是在处理大量文档和编码团队时。人工编码的替代方案包括机器学习辅助技术。这些数据科学和可视化技术提供了具有成本效益和一致性的数据分类工具,但由于两个主要原因,这些工具容易丢失参与者的含义或声音:(a)这些分类通常将配置每个输入文件(即每个访谈记录)的所有文本聚合为一个主题或代码;(b)这些配置文本的单词在其原始上下文之外进行分析。为了解决这一挑战和分析难题,我们提出了一个分析框架和软件工具,它解决了以下问题:如何在不丢失上下文或研究参与者的原始声音的情况下,有效、高效地对大量定性证据进行分类,同时利用人类推理给定性和混合方法分析表带来的细微差别?该框架反映了人工/手动代码识别中使用的逐行编码,但依赖机器学习在几分钟内而不是几个月内对文本进行分类。由此产生的输出提供了分类过程的完全透明性,并有助于重新创建嵌入输入文档中的上下文化、原始和未更改的含义,正如我们的参与者所提供的那样。我们提供访问数据库(González Canché,2022e)和所需软件(Gonzélez Canchhé,2022a,Machttps://cutt.ly/jc7n3OT、和Windowshttps://cutt.ly/wc7nNKF)以复制分析。我们希望有机会熟悉分析框架和软件,这可能会扩大数据科学工具的使用范围,以分析定性证据(另请参阅González Canché2022b、2022c、2022d,了解相关的无代码数据科学应用程序,以动态分类和分析定性和文本数据)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias
Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Qualitative Methods
International Journal of Qualitative Methods SOCIAL SCIENCES, INTERDISCIPLINARY-
CiteScore
6.90
自引率
11.10%
发文量
139
审稿时长
12 weeks
期刊介绍: Journal Highlights Impact Factor: 5.4 Ranked 5/110 in Social Sciences, Interdisciplinary – SSCI Indexed In: Clarivate Analytics: Social Science Citation Index, the Directory of Open Access Journals (DOAJ), and Scopus Launched In: 2002 Publication is subject to payment of an article processing charge (APC) Submit here International Journal of Qualitative Methods (IJQM) is a peer-reviewed open access journal which focuses on methodological advances, innovations, and insights in qualitative or mixed methods studies. Please see the Aims and Scope tab for further information.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信