Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik
{"title":"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":null,"url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\ncommands, have seen significant progress primarily for the SQLite dialect.\nHowever, adapting these systems to other SQL dialects like BigQuery and\nPostgreSQL remains a challenge due to the diversity in SQL syntax and\nfunctions. We introduce SQL-GEN, a framework for generating high-quality\ndialect-specific synthetic data guided by dialect-specific tutorials, and\ndemonstrate its effectiveness in creating training datasets for multiple\ndialects. Our approach significantly improves performance, by up to 20\\%, over\nprevious methods and reduces the gap with large-scale human-annotated datasets.\nMoreover, combining our synthetic data with human-annotated data provides\nadditional performance boosts of 3.3\\% to 5.6\\%. We also introduce a novel\nMixture of Experts (MoE) initialization method that integrates dialect-specific\nmodels into a unified system by merging self-attention layers and initializing\nthe gates with dialect-specific keywords, further enhancing performance across\ndifferent SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-SQL systems, which convert natural language queries into SQL
commands, have seen significant progress primarily for the SQLite dialect.
However, adapting these systems to other SQL dialects like BigQuery and
PostgreSQL remains a challenge due to the diversity in SQL syntax and
functions. We introduce SQL-GEN, a framework for generating high-quality
dialect-specific synthetic data guided by dialect-specific tutorials, and
demonstrate its effectiveness in creating training datasets for multiple
dialects. Our approach significantly improves performance, by up to 20\%, over
previous methods and reduces the gap with large-scale human-annotated datasets.
Moreover, combining our synthetic data with human-annotated data provides
additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel
Mixture of Experts (MoE) initialization method that integrates dialect-specific
models into a unified system by merging self-attention layers and initializing
the gates with dialect-specific keywords, further enhancing performance across
different SQL dialects.