Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation

IF 1 4区工程技术 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEJ Transactions on Electrical and Electronic Engineering Pub Date : 2025-01-10 DOI:10.1002/tee.24268

Chuanlong Liu, Wei Liao, Zhen Xu

{"title":"Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation","authors":"Chuanlong Liu, Wei Liao, Zhen Xu","doi":"10.1002/tee.24268","DOIUrl":null,"url":null,"abstract":"<p>In the field of natural language to SQL (NL2SQL), significant progress has been made with large pre-trained language models. However, these models still have deficiencies in terms of their ability to generalize, particularly in open-source Large Language Models (LLMs). Additionally, most research efforts tend to overlook the impact of key column information and data table content on the accuracy of queries during the SQL statement generation process. In this paper, we propose a staged, multi-strategy framework called Key Columns and Table Contents (KCTC). The framework is divided into two stages. Firstly, it uses fixed prompt content to extract SQL key column information from natural language questions, including selected columns and conditioned columns. It also formats the output of column information. Secondly, it combines variable prompt content to guide the model in generating SQL statements. It uses the content of the data table for constraints to reduce the impact of errors in condition values on SQL statements. We conducted experiments on the Chinese dataset TableQA using several open-source LLMs. The results demonstrate that our method significantly improved the execution accuracy of SQL statements, with an average increase of 60.29% and reaching up to 91.22% accuracy. This result validates the effectiveness of our approach. © 2025 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.</p>","PeriodicalId":13435,"journal":{"name":"IEEJ Transactions on Electrical and Electronic Engineering","volume":"20 7","pages":"1056-1065"},"PeriodicalIF":1.0000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEJ Transactions on Electrical and Electronic Engineering","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/tee.24268","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In the field of natural language to SQL (NL2SQL), significant progress has been made with large pre-trained language models. However, these models still have deficiencies in terms of their ability to generalize, particularly in open-source Large Language Models (LLMs). Additionally, most research efforts tend to overlook the impact of key column information and data table content on the accuracy of queries during the SQL statement generation process. In this paper, we propose a staged, multi-strategy framework called Key Columns and Table Contents (KCTC). The framework is divided into two stages. Firstly, it uses fixed prompt content to extract SQL key column information from natural language questions, including selected columns and conditioned columns. It also formats the output of column information. Secondly, it combines variable prompt content to guide the model in generating SQL statements. It uses the content of the data table for constraints to reduce the impact of errors in condition values on SQL statements. We conducted experiments on the Chinese dataset TableQA using several open-source LLMs. The results demonstrate that our method significantly improved the execution accuracy of SQL statements, with an average increase of 60.29% and reaching up to 91.22% accuracy. This result validates the effectiveness of our approach. © 2025 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.

查看原文本刊更多论文

基于开源大型语言模型的分阶段多策略框架，用于自然语言到SQL的生成

在自然语言到SQL （NL2SQL）领域，大型预训练语言模型已经取得了重大进展。然而，这些模型在泛化能力方面仍然存在缺陷，特别是在开源大型语言模型（llm）中。此外，大多数研究工作往往忽略了在SQL语句生成过程中键列信息和数据表内容对查询准确性的影响。在本文中，我们提出了一个分阶段的多策略框架，称为关键列和表内容（KCTC）。该框架分为两个阶段。首先，它使用固定的提示内容从自然语言问题中提取SQL关键列信息，包括选定列和条件列。它还格式化列信息的输出。其次，结合变量提示内容来指导模型生成SQL语句。它使用数据表的内容作为约束，以减少条件值错误对SQL语句的影响。我们使用几个开源llm对中文数据集TableQA进行了实验。结果表明，我们的方法显著提高了SQL语句的执行精度，平均提高了60.29%，准确率达到91.22%。这一结果验证了我们方法的有效性。©2025日本电气工程师协会和Wiley期刊有限责任公司。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEJ Transactions on Electrical and Electronic Engineering 工程技术-工程：电子与电气

CiteScore

2.70

自引率

10.00%

发文量

199

审稿时长

4.3 months

期刊介绍： IEEJ Transactions on Electrical and Electronic Engineering (hereinafter called TEEE ) publishes 6 times per year as an official journal of the Institute of Electrical Engineers of Japan (hereinafter "IEEJ"). This peer-reviewed journal contains original research papers and review articles on the most important and latest technological advances in core areas of Electrical and Electronic Engineering and in related disciplines. The journal also publishes short communications reporting on the results of the latest research activities TEEE ) aims to provide a new forum for IEEJ members in Japan as well as fellow researchers in Electrical and Electronic Engineering from around the world to exchange ideas and research findings.