Bridging Qualitative Data Silos: The Potential of Reusing Codings Through Machine Learning Based Cross-Study Code Linking

IF 2.7 2区社会学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Social Science Computer Review Pub Date : 2023-11-13 DOI:10.1177/08944393231215459

Sergej Wildemann, Claudia Niederée, Erick Elejalde

{"title":"Bridging Qualitative Data Silos: The Potential of Reusing Codings Through Machine Learning Based Cross-Study Code Linking","authors":"Sergej Wildemann, Claudia Niederée, Erick Elejalde","doi":"10.1177/08944393231215459","DOIUrl":null,"url":null,"abstract":"For qualitative data analysis (QDA), researchers assign codes to text segments to arrange the information into topics or concepts. These annotations facilitate information retrieval and the identification of emerging patterns in unstructured data. However, this metadata is typically not published or reused after the research. Subsequent studies with similar research questions require a new definition of codes and do not benefit from other analysts’ experience. Machine learning (ML) based classification seeded with such data remains a challenging task due to the ambiguity of code definitions and the inherent subjectivity of the exercise. Previous attempts to support QDA using ML rely on linear models and only examined individual datasets that were either smaller or coded specifically for this purpose. However, we show that modern approaches effectively capture at least part of the codes’ semantics and may generalize to multiple studies. We analyze the performance of multiple classifiers across three large real-world datasets. Furthermore, we propose an ML-based approach to identify semantic relations of codes in different studies to show thematic faceting, enhance retrieval of related content, or bootstrap the coding process. These are encouraging results that suggest how analysts might benefit from prior interpretation efforts, potentially yielding new insights into qualitative data.","PeriodicalId":49509,"journal":{"name":"Social Science Computer Review","volume":"11 2","pages":"0"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social Science Computer Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/08944393231215459","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

For qualitative data analysis (QDA), researchers assign codes to text segments to arrange the information into topics or concepts. These annotations facilitate information retrieval and the identification of emerging patterns in unstructured data. However, this metadata is typically not published or reused after the research. Subsequent studies with similar research questions require a new definition of codes and do not benefit from other analysts’ experience. Machine learning (ML) based classification seeded with such data remains a challenging task due to the ambiguity of code definitions and the inherent subjectivity of the exercise. Previous attempts to support QDA using ML rely on linear models and only examined individual datasets that were either smaller or coded specifically for this purpose. However, we show that modern approaches effectively capture at least part of the codes’ semantics and may generalize to multiple studies. We analyze the performance of multiple classifiers across three large real-world datasets. Furthermore, we propose an ML-based approach to identify semantic relations of codes in different studies to show thematic faceting, enhance retrieval of related content, or bootstrap the coding process. These are encouraging results that suggest how analysts might benefit from prior interpretation efforts, potentially yielding new insights into qualitative data.

查看原文本刊更多论文

桥接定性数据孤岛:通过基于机器学习的交叉研究代码链接重用编码的潜力

在定性数据分析(QDA)中，研究人员为文本片段分配代码，以将信息排列成主题或概念。这些注释有助于信息检索和识别非结构化数据中出现的模式。然而，这些元数据通常不会在研究结束后发布或重用。后续的研究与类似的研究问题需要一个新的代码定义，并没有受益于其他分析师的经验。基于机器学习(ML)的分类基于这样的数据种子仍然是一个具有挑战性的任务，由于代码定义的模糊性和固有的主观性练习。以前使用ML支持QDA的尝试依赖于线性模型，并且只检查了较小或专门为此目的编码的单个数据集。然而，我们表明，现代方法有效地捕获至少部分代码的语义，并可以推广到多个研究。我们在三个大型真实数据集上分析了多个分类器的性能。此外，我们提出了一种基于机器学习的方法来识别不同研究中代码的语义关系，以显示主题面形，增强相关内容的检索，或引导编码过程。这些令人鼓舞的结果表明，分析师如何从先前的解释工作中受益，可能会对定性数据产生新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Social Science Computer Review 社会科学-计算机：跨学科应用

CiteScore

9.00

自引率

4.90%

发文量

审稿时长

>12 weeks

期刊介绍： Unique Scope Social Science Computer Review is an interdisciplinary journal covering social science instructional and research applications of computing, as well as societal impacts of informational technology. Topics included: artificial intelligence, business, computational social science theory, computer-assisted survey research, computer-based qualitative analysis, computer simulation, economic modeling, electronic modeling, electronic publishing, geographic information systems, instrumentation and research tools, public administration, social impacts of computing and telecommunications, software evaluation, world-wide web resources for social scientists. Interdisciplinary Nature Because the Uses and impacts of computing are interdisciplinary, so is Social Science Computer Review. The journal is of direct relevance to scholars and scientists in a wide variety of disciplines. In its pages you''ll find work in the following areas: sociology, anthropology, political science, economics, psychology, computer literacy, computer applications, and methodology.