基于一步一步解析的文本到sql生成框架

2023 7th International Conference on Machine Vision and Information Technology (CMVIT) Pub Date : 2023-03-01 DOI:10.1109/CMVIT57620.2023.00030

Ran Shen, Gang Sun, Hao Shen, Yiling Li, Liangfeng Jin, Han Jiang

{"title":"基于一步一步解析的文本到sql生成框架","authors":"Ran Shen, Gang Sun, Hao Shen, Yiling Li, Liangfeng Jin, Han Jiang","doi":"10.1109/CMVIT57620.2023.00030","DOIUrl":null,"url":null,"abstract":"Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks——table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.","PeriodicalId":191655,"journal":{"name":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation\",\"authors\":\"Ran Shen, Gang Sun, Hao Shen, Yiling Li, Liangfeng Jin, Han Jiang\",\"doi\":\"10.1109/CMVIT57620.2023.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks——table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.\",\"PeriodicalId\":191655,\"journal\":{\"name\":\"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CMVIT57620.2023.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CMVIT57620.2023.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

将文本转换为结构化查询语言(Text2SQL)是自然语言处理(NLP)领域的一个研究热点，具有广阔的应用前景。在大数据时代，数据库的使用已经渗透到各行各业，其中收集的数据规模大、种类多、范围广，使得数据查询繁琐、效率低下，对Text2SQL模型提出了更高的要求。在实际应用中，目前主流的端到端Text2SQL模型不仅结构复杂，对训练数据要求高，难以构建，而且参数量大，难以调整。此外，模型的精度难以达到预期的结果。基于此，本文提出了一种流水线化的Text2SQL方法:SPSQL。该方法将Text2SQL任务分解为四个子任务——表选择、列选择、SQL生成和值填充，这些子任务可以分别转换为一个文本分类问题、一个序列标记问题和两个文本生成问题。然后，在现有数据的基础上构建不同子任务的数据格式，通过提高各子模型的精度来提高整体模型的精度。我们还使用命名实体识别模块和数据增强来优化整体模型。我们基于国家电网公司的营销业务数据构建数据集。实验表明，与端到端方法和其他管道方法相比，该方法具有最佳的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation

Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks——table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 7th International Conference on Machine Vision and Information Technology (CMVIT)

自引率

0.00%

发文量