{"title":"Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models","authors":"Rui Ye, Rui Ge, Yuchi Fengting, Jingyi Chai, Yanfeng Wang, Siheng Chen","doi":"arxiv-2409.07136","DOIUrl":null,"url":null,"abstract":"Federated instruction tuning enables multiple clients to collaboratively\nfine-tune a shared large language model (LLM) that can follow humans'\ninstructions without directly sharing raw data. However, existing literature\nimpractically requires that all the clients readily hold instruction-tuning\ndata (i.e., structured instruction-response pairs), which necessitates massive\nhuman annotations since clients' data is usually unstructured text instead.\nAddressing this, we propose a novel and flexible framework FedIT-U2S, which can\nautomatically transform unstructured corpus into structured data for federated\ninstruction tuning. FedIT-U2S consists two key steps: (1) few-shot\ninstruction-tuning data generation, where each unstructured data piece together\nwith several examples is combined to prompt an LLM in generating an\ninstruction-response pair. To further enhance the flexibility, a\nretrieval-based example selection technique is proposed, where the examples are\nautomatically selected based on the relatedness between the client's data piece\nand example pool, bypassing the need of determining examples in advance. (2) A\ntypical federated instruction tuning process based on the generated data.\nOverall, FedIT-U2S can be applied to diverse scenarios as long as the client\nholds valuable text corpus, broadening the application scope of federated\ninstruction tuning. We conduct a series of experiments on three domains\n(medicine, knowledge, and math), showing that our proposed FedIT-U2S can\nconsistently and significantly brings improvement over the base LLM.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Federated instruction tuning enables multiple clients to collaboratively
fine-tune a shared large language model (LLM) that can follow humans'
instructions without directly sharing raw data. However, existing literature
impractically requires that all the clients readily hold instruction-tuning
data (i.e., structured instruction-response pairs), which necessitates massive
human annotations since clients' data is usually unstructured text instead.
Addressing this, we propose a novel and flexible framework FedIT-U2S, which can
automatically transform unstructured corpus into structured data for federated
instruction tuning. FedIT-U2S consists two key steps: (1) few-shot
instruction-tuning data generation, where each unstructured data piece together
with several examples is combined to prompt an LLM in generating an
instruction-response pair. To further enhance the flexibility, a
retrieval-based example selection technique is proposed, where the examples are
automatically selected based on the relatedness between the client's data piece
and example pool, bypassing the need of determining examples in advance. (2) A
typical federated instruction tuning process based on the generated data.
Overall, FedIT-U2S can be applied to diverse scenarios as long as the client
holds valuable text corpus, broadening the application scope of federated
instruction tuning. We conduct a series of experiments on three domains
(medicine, knowledge, and math), showing that our proposed FedIT-U2S can
consistently and significantly brings improvement over the base LLM.