SQL-GEN：通过合成数据和模型合并弥合文本到 SQL 的方言差距

arXiv - CS - Databases Pub Date : 2024-08-22 DOI:arxiv-2408.12733

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik

{"title":"SQL-GEN：通过合成数据和模型合并弥合文本到 SQL 的方言差距","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":null,"url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\ncommands, have seen significant progress primarily for the SQLite dialect.\nHowever, adapting these systems to other SQL dialects like BigQuery and\nPostgreSQL remains a challenge due to the diversity in SQL syntax and\nfunctions. We introduce SQL-GEN, a framework for generating high-quality\ndialect-specific synthetic data guided by dialect-specific tutorials, and\ndemonstrate its effectiveness in creating training datasets for multiple\ndialects. Our approach significantly improves performance, by up to 20\\%, over\nprevious methods and reduces the gap with large-scale human-annotated datasets.\nMoreover, combining our synthetic data with human-annotated data provides\nadditional performance boosts of 3.3\\% to 5.6\\%. We also introduce a novel\nMixture of Experts (MoE) initialization method that integrates dialect-specific\nmodels into a unified system by merging self-attention layers and initializing\nthe gates with dialect-specific keywords, further enhancing performance across\ndifferent SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging\",\"authors\":\"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik\",\"doi\":\"arxiv-2408.12733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-SQL systems, which convert natural language queries into SQL\\ncommands, have seen significant progress primarily for the SQLite dialect.\\nHowever, adapting these systems to other SQL dialects like BigQuery and\\nPostgreSQL remains a challenge due to the diversity in SQL syntax and\\nfunctions. We introduce SQL-GEN, a framework for generating high-quality\\ndialect-specific synthetic data guided by dialect-specific tutorials, and\\ndemonstrate its effectiveness in creating training datasets for multiple\\ndialects. Our approach significantly improves performance, by up to 20\\\\%, over\\nprevious methods and reduces the gap with large-scale human-annotated datasets.\\nMoreover, combining our synthetic data with human-annotated data provides\\nadditional performance boosts of 3.3\\\\% to 5.6\\\\%. We also introduce a novel\\nMixture of Experts (MoE) initialization method that integrates dialect-specific\\nmodels into a unified system by merging self-attention layers and initializing\\nthe gates with dialect-specific keywords, further enhancing performance across\\ndifferent SQL dialects.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.12733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

文本到 SQL 系统可将自然语言查询转换为 SQL 命令，主要在 SQLite 方言方面取得了重大进展。然而，由于 SQL 语法和功能的多样性，将这些系统适用于 BigQuery 和 PostgreSQL 等其他 SQL 方言仍然是一项挑战。我们介绍了 SQL-GEN，这是一个在特定方言教程指导下生成高质量特定方言合成数据的框架，并演示了它在创建多方言训练数据集方面的有效性。与以前的方法相比，我们的方法大大提高了性能，提高幅度高达20%，并缩小了与大规模人类标注数据集的差距。此外，将我们的合成数据与人类标注数据相结合，还能使性能提高3.3%到5.6%。我们还引入了一种新颖的专家混合（MoE）初始化方法，该方法通过合并自注意层和使用特定方言关键词初始化门，将特定方言模型集成到一个统一的系统中，从而进一步提高了跨不同 SQL 方言的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20\%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量