Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-09-12 DOI:10.1109/TKDE.2025.3609486

Zijin Hong;Zheng Yuan;Qinggang Zhang;Hao Chen;Junnan Dong;Feiran Huang;Xiao Huang

{"title":"Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL","authors":"Zijin Hong;Zheng Yuan;Qinggang Zhang;Hao Chen;Junnan Dong;Feiran Huang;Xiao Huang","doi":"10.1109/TKDE.2025.3609486","DOIUrl":null,"url":null,"abstract":"Generating accurate SQL from users’ natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restrict the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summary and discuss the remaining challenges in this field and suggest expectations for future research directions.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7328-7345"},"PeriodicalIF":10.4000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11160657/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Generating accurate SQL from users’ natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restrict the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summary and discuss the remaining challenges in this field and suggest expectations for future research directions.

查看原文本刊更多论文

下一代数据库接口：基于llm的文本到sql的综述

从用户的自然语言问题（文本到SQL）生成准确的SQL仍然是一个长期存在的挑战，因为涉及到用户问题理解、数据库模式理解和SQL生成的复杂性。传统的文本到sql的系统，结合了人类工程学和深度神经网络，已经取得了重大进展。随后，针对文本到sql的任务开发了预训练语言模型（plm），取得了令人鼓舞的结果。然而，随着现代数据库和用户问题变得越来越复杂，具有有限参数大小的plm经常产生不正确的SQL。这需要更复杂和定制的优化方法，这限制了基于plm的系统的应用。近年来，随着模型规模的增加，大型语言模型（llm）在自然语言理解方面表现出了显著的能力。因此，集成基于llm的解决方案可以为文本到sql的研究带来独特的机会、改进和解决方案。在本调查中，我们对现有的基于法学硕士的文本到sql的研究进行了全面的回顾。具体来说，我们简要概述了从文本到sql的技术挑战和演进过程。接下来，我们将介绍用于评估文本到sql系统的数据集和指标。随后，我们对基于法学硕士的文本到sql的最新进展进行了系统分析。最后，对该领域存在的挑战进行了总结和讨论，并对未来的研究方向提出了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.