Automated generation of research workflows from academic papers: a full-text mining framework

IF 3.5 2区管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Informetrics Pub Date : 2025-09-22 DOI:10.1016/j.joi.2025.101732

Heng Zhang , Chengzhi Zhang

{"title":"Automated generation of research workflows from academic papers: a full-text mining framework","authors":"Heng Zhang , Chengzhi Zhang","doi":"10.1016/j.joi.2025.101732","DOIUrl":null,"url":null,"abstract":"<div><div>The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F<sub>1</sub>-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: h ttps://github.co<em>m/Z</em>H-heng/research_workflow.</div></div>","PeriodicalId":48662,"journal":{"name":"Journal of Informetrics","volume":"19 4","pages":"Article 101732"},"PeriodicalIF":3.5000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Informetrics","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S175115772500094X","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F₁-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: h ttps://github.com/ZH-heng/research_workflow.

查看原文本刊更多论文

从学术论文中自动生成研究工作流：一个全文挖掘框架

研究工作流程的自动化生成对于提高研究的可重复性和加速“科学人工智能”范式至关重要。然而，现有的方法通常只提取碎片化的程序组件，因此无法捕获完整的研究工作流程。为了解决这一差距，我们提出了一个端到端框架，通过挖掘全文学术论文来生成全面、结构化的研究工作流程。作为自然语言处理（NLP）领域的一个案例研究，我们以段落为中心的方法首先使用SciBERT的积极未标记（PU）学习来识别工作流描述段落，获得了0.9772的f1分数。随后，我们利用快速学习的Flan-T5从这些段落中生成工作流短语，分别产生ROUGE-1， ROUGE-2和ROUGE-L得分为0.4543,0.2877和0.4427。然后使用ChatGPT结合few-shot学习将这些短语系统地分为数据准备、数据处理和数据分析三个阶段，分类精度达到0.958。通过将分类短语映射到它们在文档中的文档位置，我们最终生成整个研究工作流程的可读可视化流程图。这种方法有助于分析来自NLP语料库的工作流程，并揭示了过去二十年来主要的方法转变，包括对数据分析的日益重视以及从特征工程到消融研究的转变。我们的工作为自动化工作流生成提供了一个经过验证的技术框架，同时为不断发展的科学范式的实证研究提供了一个新颖的、面向过程的视角。源代码和数据可在：h ttps://github.com/ZH-heng/research_workflow。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Informetrics Social Sciences-Library and Information Sciences

CiteScore

6.40

自引率

16.20%

发文量

期刊介绍： Journal of Informetrics (JOI) publishes rigorous high-quality research on quantitative aspects of information science. The main focus of the journal is on topics in bibliometrics, scientometrics, webometrics, patentometrics, altmetrics and research evaluation. Contributions studying informetric problems using methods from other quantitative fields, such as mathematics, statistics, computer science, economics and econometrics, and network science, are especially encouraged. JOI publishes both theoretical and empirical work. In general, case studies, for instance a bibliometric analysis focusing on a specific research field or a specific country, are not considered suitable for publication in JOI, unless they contain innovative methodological elements.