Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models

arXiv - CS - Multiagent Systems Pub Date : 2024-09-11 DOI:arxiv-2409.07136

Rui Ye, Rui Ge, Yuchi Fengting, Jingyi Chai, Yanfeng Wang, Siheng Chen

{"title":"Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models","authors":"Rui Ye, Rui Ge, Yuchi Fengting, Jingyi Chai, Yanfeng Wang, Siheng Chen","doi":"arxiv-2409.07136","DOIUrl":null,"url":null,"abstract":"Federated instruction tuning enables multiple clients to collaboratively\nfine-tune a shared large language model (LLM) that can follow humans'\ninstructions without directly sharing raw data. However, existing literature\nimpractically requires that all the clients readily hold instruction-tuning\ndata (i.e., structured instruction-response pairs), which necessitates massive\nhuman annotations since clients' data is usually unstructured text instead.\nAddressing this, we propose a novel and flexible framework FedIT-U2S, which can\nautomatically transform unstructured corpus into structured data for federated\ninstruction tuning. FedIT-U2S consists two key steps: (1) few-shot\ninstruction-tuning data generation, where each unstructured data piece together\nwith several examples is combined to prompt an LLM in generating an\ninstruction-response pair. To further enhance the flexibility, a\nretrieval-based example selection technique is proposed, where the examples are\nautomatically selected based on the relatedness between the client's data piece\nand example pool, bypassing the need of determining examples in advance. (2) A\ntypical federated instruction tuning process based on the generated data.\nOverall, FedIT-U2S can be applied to diverse scenarios as long as the client\nholds valuable text corpus, broadening the application scope of federated\ninstruction tuning. We conduct a series of experiments on three domains\n(medicine, knowledge, and math), showing that our proposed FedIT-U2S can\nconsistently and significantly brings improvement over the base LLM.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans' instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since clients' data is usually unstructured text instead. Addressing this, we propose a novel and flexible framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning. FedIT-U2S consists two key steps: (1) few-shot instruction-tuning data generation, where each unstructured data piece together with several examples is combined to prompt an LLM in generating an instruction-response pair. To further enhance the flexibility, a retrieval-based example selection technique is proposed, where the examples are automatically selected based on the relatedness between the client's data piece and example pool, bypassing the need of determining examples in advance. (2) A typical federated instruction tuning process based on the generated data. Overall, FedIT-U2S can be applied to diverse scenarios as long as the client holds valuable text corpus, broadening the application scope of federated instruction tuning. We conduct a series of experiments on three domains (medicine, knowledge, and math), showing that our proposed FedIT-U2S can consistently and significantly brings improvement over the base LLM.

查看原文本刊更多论文

利用非结构化文本数据对大型语言模型进行联合教学调整

联合指令调谐使多个客户端能够协作精细调谐一个共享的大型语言模型（LLM），该模型能够遵循人类的指令，而无需直接共享原始数据。然而，现有文献实际上要求所有客户端都能随时掌握指令调谐数据（即结构化指令-响应对），这就需要大量的人工注释，因为客户端的数据通常是非结构化文本。FedIT-U2S 包括两个关键步骤：(1) 少量指令调整数据生成，将每个非结构化数据片段与多个示例结合起来，促使 LLM 生成指令-响应对。为了进一步提高灵活性，还提出了基于检索的示例选择技术，即根据客户数据片段与示例池之间的相关性自动选择示例，而无需事先确定示例。(总体而言，只要客户拥有有价值的文本语料库，FedIT-U2S就能应用于多种场景，拓宽了联合指令调优的应用范围。我们在三个领域（医学、知识和数学）进行了一系列实验，结果表明我们提出的 FedIT-U2S 取消了基础 LLM，并带来了显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multiagent Systems

自引率

0.00%

发文量