CatSQL: Towards Real World Natural Language to SQL Applications

Proc. VLDB Endow. Pub Date : 2023-02-01 DOI:10.14778/3583140.3583165

Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun

{"title":"CatSQL: Towards Real World Natural Language to SQL Applications","authors":"Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun","doi":"10.14778/3583140.3583165","DOIUrl":null,"url":null,"abstract":"\n Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel\n CatSQL\n sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our\n CatSQL\n sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the\n Semantics Correction\n technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution.\n Semantics Correction\n is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our\n CatSQL\n prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1534-1547"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3583140.3583165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases, especially for non-expert users, to conduct various data analytics. Existing methods often employ either a rule-base approach or a deep learning based solution. The former is hard to generalize across different domains. Though the latter generalizes well, it often results in queries with syntactic or semantic errors, thus may be even not executable. In this work, we bridge the gap between the two and design a new framework to significantly improve both accuracy and runtime. In particular, we develop a novel CatSQL sketch, which constructs a template with slots that initially serve as placeholders, and tightly integrates with a deep learning model to fill in these slots with meaningful contents based on the database schema. Compared with the widely used sequence-to-sequence-based approaches, our sketch-based method does not need to generate keywords which are boilerplates in the template, and can achieve better accuracy and run much faster. Compared with the existing sketch-based approaches, our CatSQL sketch is more general and versatile, and can leverage the values already filled in on certain slots to derive the rest ones for improved performance. In addition, we propose the Semantics Correction technique, which is the first that leverages database domain knowledge in a deep learning based NL2SQL solution. Semantics Correction is a post-processing routine, which checks the initially generated SQL queries by applying rules to identify and correct semantic errors. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluations on both single-domain and cross-domain benchmarks and demonstrate that our approach significantly outperforms the previous ones in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark Spider, our CatSQL prototype outperforms the best of the previous solutions by 4 points on accuracy, while still achieving a throughput up to 63 times higher.

查看原文本刊更多论文

CatSQL:从真实世界的自然语言到SQL应用程序

自然语言到SQL (NL2SQL)技术为访问数据库提供了方便的接口，特别是对于非专业用户，可以进行各种数据分析。现有的方法通常采用基于规则的方法或基于深度学习的解决方案。前者很难在不同的领域进行概括。尽管后者泛化得很好，但它经常导致查询出现语法或语义错误，因此甚至可能无法执行。在这项工作中，我们弥合了两者之间的差距，并设计了一个新的框架，以显着提高准确性和运行时间。特别是，我们开发了一个新颖的CatSQL草图，它构建了一个带有槽的模板，这些槽最初用作占位符，并与深度学习模型紧密集成，根据数据库模式用有意义的内容填充这些槽。与广泛使用的基于序列到序列的方法相比，我们的基于草图的方法不需要在模板中生成作为样板的关键字，并且可以达到更好的准确性和更快的运行速度。与现有的基于草图的方法相比，我们的CatSQL草图更加通用和通用，并且可以利用在某些槽中已经填充的值来派生其余的值以提高性能。此外，我们提出了语义校正技术，这是第一个在基于深度学习的NL2SQL解决方案中利用数据库领域知识的技术。语义纠正是一个后处理例程，它通过应用规则来识别和纠正语义错误，从而检查最初生成的SQL查询。这种技术显著提高了NL2SQL的准确性。我们对单域和跨域基准测试进行了广泛的评估，并证明我们的方法在准确性和吞吐量方面都明显优于以前的方法。特别是，在最先进的NL2SQL基准Spider上，我们的CatSQL原型在准确性上比以前的最佳解决方案高出4分，同时仍然实现了高达63倍的吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量