Fanyu Han , Li Ma , Fenglin Bi , Yantong Wang , Mingdong You , Wei Wang , Jiaheng Peng , Xiaoya Xia
{"title":"osat - gpt:在Github中使用开源原子任务对大型语言模型进行指令调优","authors":"Fanyu Han , Li Ma , Fenglin Bi , Yantong Wang , Mingdong You , Wei Wang , Jiaheng Peng , Xiaoya Xia","doi":"10.1016/j.eswa.2025.129819","DOIUrl":null,"url":null,"abstract":"<div><div>Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"298 ","pages":"Article 129819"},"PeriodicalIF":7.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OSATG-GPT: Instruction-tuning large language models with open-source atomic tasks in Github\",\"authors\":\"Fanyu Han , Li Ma , Fenglin Bi , Yantong Wang , Mingdong You , Wei Wang , Jiaheng Peng , Xiaoya Xia\",\"doi\":\"10.1016/j.eswa.2025.129819\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"298 \",\"pages\":\"Article 129819\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425034347\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425034347","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
OSATG-GPT: Instruction-tuning large language models with open-source atomic tasks in Github
Across numerous application scenarios in Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated exceptional capabilities in text comprehension and generation. These models exhibit significant potential across various interdisciplinary fields. However, their effectiveness is somewhat constrained by the unique characteristics of the open-source ecosystem. Developing an LLM with generalization capabilities across datasets and tasks, specifically tailored for the open-source ecosystem, is an urgent research need. To address this challenge, this paper introduces open-source atomic tasks, which are defined as intermediate tasks essential for solving complex objectives. These tasks are designed through strategies such as simplification, reversal, decomposition, and composition, enabling models to gradually acquire domain knowledge and understand task interdependencies. By integrating public resources with open-source atomic tasks, we construct OSE-Instruct–an instruction dataset for the open-source ecosystem. We first unify open-source atomic tasks within an instruction-tuning paradigm that reflects real-world developer behavior, and develop OSATG-GPT at various parameter scales by fine-tuning the BLOOMZ backbone model on OSE-Instruct. This enables the model to learn fine-grained developer actions and the underlying task dependencies. Extensive experiments validate the effectiveness of OSATG-GPT compared to other advanced LLMs with larger parameter scales, and highlight its advantages over GPT-4 in specific and complex open-source collaboration tasks.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.