Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

arXiv - CS - Databases Pub Date : 2024-07-28 DOI:arxiv-2407.19517

Limin Ma, Ken Pu, Ying Zhu

{"title":"Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload","authors":"Limin Ma, Ken Pu, Ying Zhu","doi":"arxiv-2407.19517","DOIUrl":null,"url":null,"abstract":"This study presents a comparative analysis of the a complex SQL benchmark,\nTPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings\nreveal that TPC-DS queries exhibit a significantly higher level of structural\ncomplexity compared to the other two benchmarks. This underscores the need for\nmore intricate benchmarks to simulate realistic scenarios effectively. To\nfacilitate this comparison, we devised several measures of structural\ncomplexity and applied them across all three benchmarks. The results of this\nstudy can guide future research in the development of more sophisticated\ntext-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based\non the query descriptions provided by the TPC-DS benchmark. The prompt\nengineering process incorporated both the query description as outlined in the\nTPC-DS specification and the database schema of TPC-DS. Our findings indicate\nthat the current state-of-the-art generative AI models fall short in generating\naccurate decision-making queries. We conducted a comparison of the generated\nqueries with the TPC-DS gold standard queries using a series of fuzzy structure\nmatching techniques based on query features. The results demonstrated that the\naccuracy of the generated queries is insufficient for practical real-world\napplication.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.19517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.

查看原文本刊更多论文

利用复杂的 SQL 工作负载评估文本到 SQL 生成的 LLM

本研究对复杂 SQL 基准 TPC-DS 与现有的两个文本到 SQL 基准 BIRD 和 Spider 进行了比较分析。我们的研究结果表明，与其他两个基准相比，TPC-DS 查询表现出更高的结构复杂性。这说明需要更复杂的基准来有效模拟现实场景。为了便于比较，我们设计了几种结构复杂性测量方法，并将它们应用于所有三种基准。这项研究的结果可以指导未来开发更复杂的文本到 SQL 基准的研究。我们利用 11 种不同的语言模型 (LLM) 根据 TPC-DS 基准提供的查询描述生成 SQL 查询。提示工程过程既包括 TPC-DS 规范中概述的查询描述，也包括 TPC-DS 的数据库模式。我们的研究结果表明，当前最先进的生成式人工智能模型在生成准确的决策查询方面存在不足。我们使用一系列基于查询特征的模糊结构匹配技术，将生成的查询与TPC-DS黄金标准查询进行了比较。结果表明，生成查询的准确性不足以满足实际应用的需要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量