Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
{"title":"Source2Synth:基于真实数据源的合成数据生成和整理","authors":"Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli","doi":"arxiv-2409.08239","DOIUrl":null,"url":null,"abstract":"Large Language Models still struggle in challenging scenarios that leverage\nstructured data, complex reasoning, or tool usage. In this paper, we propose\nSource2Synth: a new method that can be used for teaching LLMs new skills\nwithout relying on costly human annotations. Source2Synth takes as input a\ncustom data source and produces synthetic data points with intermediate\nreasoning steps grounded in real-world sources. Source2Synth improves the\ndataset quality by discarding low-quality generations based on their\nanswerability. We demonstrate the generality of this approach by applying it to\ntwo challenging domains: we test reasoning abilities in multi-hop question\nanswering (MHQA), and tool usage in tabular question answering (TQA). Our\nmethod improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on\nHotPotQA compared to the fine-tuned baselines.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"57 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources\",\"authors\":\"Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli\",\"doi\":\"arxiv-2409.08239\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models still struggle in challenging scenarios that leverage\\nstructured data, complex reasoning, or tool usage. In this paper, we propose\\nSource2Synth: a new method that can be used for teaching LLMs new skills\\nwithout relying on costly human annotations. Source2Synth takes as input a\\ncustom data source and produces synthetic data points with intermediate\\nreasoning steps grounded in real-world sources. Source2Synth improves the\\ndataset quality by discarding low-quality generations based on their\\nanswerability. We demonstrate the generality of this approach by applying it to\\ntwo challenging domains: we test reasoning abilities in multi-hop question\\nanswering (MHQA), and tool usage in tabular question answering (TQA). Our\\nmethod improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on\\nHotPotQA compared to the fine-tuned baselines.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"57 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08239\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Large Language Models still struggle in challenging scenarios that leverage
structured data, complex reasoning, or tool usage. In this paper, we propose
Source2Synth: a new method that can be used for teaching LLMs new skills
without relying on costly human annotations. Source2Synth takes as input a
custom data source and produces synthetic data points with intermediate
reasoning steps grounded in real-world sources. Source2Synth improves the
dataset quality by discarding low-quality generations based on their
answerability. We demonstrate the generality of this approach by applying it to
two challenging domains: we test reasoning abilities in multi-hop question
answering (MHQA), and tool usage in tabular question answering (TQA). Our
method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on
HotPotQA compared to the fine-tuned baselines.