{"title":"潜在代码识别(LACOID):一个基于机器学习的集成框架[和开源软件],用于对大文本数据进行分类,重建上下文化/未改变的含义,并避免聚合偏差","authors":"Manuel S. González Canché","doi":"10.1177/16094069221144940","DOIUrl":null,"url":null,"abstract":"Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).","PeriodicalId":48220,"journal":{"name":"International Journal of Qualitative Methods","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias\",\"authors\":\"Manuel S. González Canché\",\"doi\":\"10.1177/16094069221144940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).\",\"PeriodicalId\":48220,\"journal\":{\"name\":\"International Journal of Qualitative Methods\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2023-01-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Qualitative Methods\",\"FirstCategoryId\":\"90\",\"ListUrlMain\":\"https://doi.org/10.1177/16094069221144940\",\"RegionNum\":2,\"RegionCategory\":\"社会学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"SOCIAL SCIENCES, INTERDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Qualitative Methods","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/16094069221144940","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}
Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias
Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).
期刊介绍:
Journal Highlights
Impact Factor: 5.4 Ranked 5/110 in Social Sciences, Interdisciplinary – SSCI
Indexed In: Clarivate Analytics: Social Science Citation Index, the Directory of Open Access Journals (DOAJ), and Scopus
Launched In: 2002
Publication is subject to payment of an article processing charge (APC)
Submit here
International Journal of Qualitative Methods (IJQM) is a peer-reviewed open access journal which focuses on methodological advances, innovations, and insights in qualitative or mixed methods studies. Please see the Aims and Scope tab for further information.