Maksim Koptelov , Jan Linck , Pierre Bisquert , Patrice Buche , Mathieu Roche
{"title":"用于定性研究的自动编码及其在农业残留物价值评估中的应用","authors":"Maksim Koptelov , Jan Linck , Pierre Bisquert , Patrice Buche , Mathieu Roche","doi":"10.1016/j.softx.2025.102258","DOIUrl":null,"url":null,"abstract":"<div><div>Qualitative research, widely employed across various academic fields, explores phenomena using non-numerical data, with a particular focus on understanding the meanings, experiences, and perspectives of participants. In contrast to other type of research, it seeks to answer how, where, what, when and why individuals behave or respond in certain ways toward specific issues or topics. Qualitative research involves collecting and analyzing textual data, with interviews playing a central role in gathering expert knowledge. An essential part of data analysis is coding, using specially developed code system hierarchy that helps to categorize and organize responses and facilitates the retrieval of insights. Manual data coding is labor-intensive, and to automate this process we developed the AgriCode tool based on machine learning and manually annotated data. To address data scarcity and improve the prediction quality of our offline classifiers, we perform data augmentation using Retrieval-Augmented Generation (RAG), a state-of-the-art method originally designed for online Q&A systems. Our tool automates the coding of interview responses within the Horizon Europe Agriloop project, which focuses on agricultural waste in the food industry. AgriCode predicts a subset of a predefined code system hierarchy, assisting a human coder by accelerating the process and identifying errors in manual coding. Although initially designed for the valorization of agricultural residues, AgriCode’s methodology can be adapted for any qualitative research domain characterized by data scarcity and the need of automated textual analysis. To achieve this, responses from the first round of interviews must be manually annotated using dedicated code system hierarchy. They can then be used for fine-tuning the model, while the RAG method can be employed to address the lack of data for certain classes.</div></div>","PeriodicalId":21905,"journal":{"name":"SoftwareX","volume":"31 ","pages":"Article 102258"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AgriCode: Automated coding for qualitative research and its application to the valorization of agricultural residues\",\"authors\":\"Maksim Koptelov , Jan Linck , Pierre Bisquert , Patrice Buche , Mathieu Roche\",\"doi\":\"10.1016/j.softx.2025.102258\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Qualitative research, widely employed across various academic fields, explores phenomena using non-numerical data, with a particular focus on understanding the meanings, experiences, and perspectives of participants. In contrast to other type of research, it seeks to answer how, where, what, when and why individuals behave or respond in certain ways toward specific issues or topics. Qualitative research involves collecting and analyzing textual data, with interviews playing a central role in gathering expert knowledge. An essential part of data analysis is coding, using specially developed code system hierarchy that helps to categorize and organize responses and facilitates the retrieval of insights. Manual data coding is labor-intensive, and to automate this process we developed the AgriCode tool based on machine learning and manually annotated data. To address data scarcity and improve the prediction quality of our offline classifiers, we perform data augmentation using Retrieval-Augmented Generation (RAG), a state-of-the-art method originally designed for online Q&A systems. Our tool automates the coding of interview responses within the Horizon Europe Agriloop project, which focuses on agricultural waste in the food industry. AgriCode predicts a subset of a predefined code system hierarchy, assisting a human coder by accelerating the process and identifying errors in manual coding. Although initially designed for the valorization of agricultural residues, AgriCode’s methodology can be adapted for any qualitative research domain characterized by data scarcity and the need of automated textual analysis. To achieve this, responses from the first round of interviews must be manually annotated using dedicated code system hierarchy. They can then be used for fine-tuning the model, while the RAG method can be employed to address the lack of data for certain classes.</div></div>\",\"PeriodicalId\":21905,\"journal\":{\"name\":\"SoftwareX\",\"volume\":\"31 \",\"pages\":\"Article 102258\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SoftwareX\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352711025002250\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SoftwareX","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352711025002250","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
定性研究广泛应用于各个学术领域,它使用非数值数据来探索现象,特别注重理解参与者的意义、经验和观点。与其他类型的研究相比,它试图回答个人如何,在哪里,什么,何时以及为什么对特定问题或主题以某种方式表现或回应。定性研究包括收集和分析文本数据,访谈在收集专家知识方面起着核心作用。数据分析的一个重要部分是编码,使用专门开发的代码系统层次结构,有助于对响应进行分类和组织,并促进见解的检索。手动数据编码是劳动密集型的,为了自动化这个过程,我们开发了基于机器学习和手动注释数据的AgriCode工具。为了解决数据稀缺问题并提高离线分类器的预测质量,我们使用检索增强生成(RAG)执行数据增强,这是一种最初为在线问答系统设计的最先进的方法。我们的工具在Horizon Europe Agriloop项目中自动编码采访响应,该项目主要关注食品工业中的农业废弃物。AgriCode预测预定义代码系统层次结构的子集,通过加速过程和识别手动编码中的错误来帮助人类编码人员。虽然最初是为农业残留物的价值评估而设计的,但AgriCode的方法可以适用于任何定性研究领域,其特征是数据稀缺和需要自动文本分析。为了实现这一点,必须使用专用的代码系统层次对第一轮面试的回答进行手动注释。然后可以使用它们对模型进行微调,同时可以使用RAG方法来解决某些类缺乏数据的问题。
AgriCode: Automated coding for qualitative research and its application to the valorization of agricultural residues
Qualitative research, widely employed across various academic fields, explores phenomena using non-numerical data, with a particular focus on understanding the meanings, experiences, and perspectives of participants. In contrast to other type of research, it seeks to answer how, where, what, when and why individuals behave or respond in certain ways toward specific issues or topics. Qualitative research involves collecting and analyzing textual data, with interviews playing a central role in gathering expert knowledge. An essential part of data analysis is coding, using specially developed code system hierarchy that helps to categorize and organize responses and facilitates the retrieval of insights. Manual data coding is labor-intensive, and to automate this process we developed the AgriCode tool based on machine learning and manually annotated data. To address data scarcity and improve the prediction quality of our offline classifiers, we perform data augmentation using Retrieval-Augmented Generation (RAG), a state-of-the-art method originally designed for online Q&A systems. Our tool automates the coding of interview responses within the Horizon Europe Agriloop project, which focuses on agricultural waste in the food industry. AgriCode predicts a subset of a predefined code system hierarchy, assisting a human coder by accelerating the process and identifying errors in manual coding. Although initially designed for the valorization of agricultural residues, AgriCode’s methodology can be adapted for any qualitative research domain characterized by data scarcity and the need of automated textual analysis. To achieve this, responses from the first round of interviews must be manually annotated using dedicated code system hierarchy. They can then be used for fine-tuning the model, while the RAG method can be employed to address the lack of data for certain classes.
期刊介绍:
SoftwareX aims to acknowledge the impact of software on today''s research practice, and on new scientific discoveries in almost all research domains. SoftwareX also aims to stress the importance of the software developers who are, in part, responsible for this impact. To this end, SoftwareX aims to support publication of research software in such a way that: The software is given a stamp of scientific relevance, and provided with a peer-reviewed recognition of scientific impact; The software developers are given the credits they deserve; The software is citable, allowing traditional metrics of scientific excellence to apply; The academic career paths of software developers are supported rather than hindered; The software is publicly available for inspection, validation, and re-use. Above all, SoftwareX aims to inform researchers about software applications, tools and libraries with a (proven) potential to impact the process of scientific discovery in various domains. The journal is multidisciplinary and accepts submissions from within and across subject domains such as those represented within the broad thematic areas below: Mathematical and Physical Sciences; Environmental Sciences; Medical and Biological Sciences; Humanities, Arts and Social Sciences. Originating from these broad thematic areas, the journal also welcomes submissions of software that works in cross cutting thematic areas, such as citizen science, cybersecurity, digital economy, energy, global resource stewardship, health and wellbeing, etcetera. SoftwareX specifically aims to accept submissions representing domain-independent software that may impact more than one research domain.