技术视角:通过迭代探索中间地带的自然语言到SQL的转换

SIGMOD Rec. Pub Date : 2016-06-02 DOI:10.1145/2949741.2949743

J. Naughton

{"title":"技术视角:通过迭代探索中间地带的自然语言到SQL的转换","authors":"J. Naughton","doi":"10.1145/2949741.2949743","DOIUrl":null,"url":null,"abstract":"A fundamental question in data management is how relational database management systems (RDBMSs) should be queried. Ideally, the query interface should be powerful enough to express arbitrary queries, yet simple enough to learn that users require virtually no training. Natural language is an obvious and appealing approach – presumably most users already know at least one natural language and use it to “query” other humans constantly. Unfortunately, employing natural language to query RDBMSs is highly nontrivial, and for the most part, not used. However, with the growing power and ubiquity of Natural Language Processing (NLP) systems, it makes sense to redouble efforts in applying NLP to database querying. At the most basic level, relational database systems are queried using SQL. (For that matter, most “NoSQL” systems are also queried using SQL.) SQL is very powerful and precise, and, for novices, very hard to write. So SQL cannot be used as a user interface for anyone but power users. Nonetheless, as the most widely used RDBMS query language, SQL is the most natural language into which to translate natural language questions over relational data. This translation is the focus of the following paper, “Understanding Natural Language Queries over Relational Databases”, by Li and Jagadish. The first important decision made by the authors of this paper is to reject a one-shot, one-way translation process from a natural language query to a corresponding SQL query. Instead, the authors advocate an iterative dialog between the person posing the query and the system building the relational query. This makes perfect sense – even in the much simpler world of keyword search systems, users iteratively refine their queries. Unfortunately, adopting this approach for RDBMS querying does not yield an easy problem – in fact, it uncovers a highly interesting and difficult challenge: how should the user and the system communicate in this iterative process? Answering this question is difficult. Unlike the case for keyword search systems, the answer to the query may not help the user know if the executed query was what they really wanted. For example, consider the simple query “find the difference between sales this year and last year.” In general the RDBMS will return a number – and it is very hard to tell just from that number if the query was correct or not. It would be far more precise for the system to respond to the user by presenting the generated SQL query itself. But this would require the person posing the natural language query to be able to read and understand SQL, which contradicts a major motivation for the system in the first place. Now we come to what is perhaps the heart of this paper: the decision to adopt an intermediate language the authors call “Query Tree,”a two-way domain-independent communication model allowing the user and system to understand one other. A query tree aids mapping a user query to its corresponding semantically correct SQL and translating a query plan to its corresponding natural language interpretation. The authors harness the schema knowledge, schema-driven similarity metrics, query tree reformulation and ranking to make the problem tractable for the system and the user. The authors close with a user study evaluating the approach. The user study itself is interesting, including the aspect of using Chinese to convey the queries to the subjects instead of English to avoid bias through the phrasing in the query description (presumably the subjects already spoke Chinese!) The experiments show that the approach is best for simple to medium complexity queries. This paper represent a significant improvement in the state of the art, and it is an ideal springboard for future advances. In an area as difficult and important as natural language querying of relational database systems, this is indeed a major contribution.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"35 1","pages":"5"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Technical Perspective: Natural Language to SQL Translation by Iteratively Exploring a Middle Ground\",\"authors\":\"J. Naughton\",\"doi\":\"10.1145/2949741.2949743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A fundamental question in data management is how relational database management systems (RDBMSs) should be queried. Ideally, the query interface should be powerful enough to express arbitrary queries, yet simple enough to learn that users require virtually no training. Natural language is an obvious and appealing approach – presumably most users already know at least one natural language and use it to “query” other humans constantly. Unfortunately, employing natural language to query RDBMSs is highly nontrivial, and for the most part, not used. However, with the growing power and ubiquity of Natural Language Processing (NLP) systems, it makes sense to redouble efforts in applying NLP to database querying. At the most basic level, relational database systems are queried using SQL. (For that matter, most “NoSQL” systems are also queried using SQL.) SQL is very powerful and precise, and, for novices, very hard to write. So SQL cannot be used as a user interface for anyone but power users. Nonetheless, as the most widely used RDBMS query language, SQL is the most natural language into which to translate natural language questions over relational data. This translation is the focus of the following paper, “Understanding Natural Language Queries over Relational Databases”, by Li and Jagadish. The first important decision made by the authors of this paper is to reject a one-shot, one-way translation process from a natural language query to a corresponding SQL query. Instead, the authors advocate an iterative dialog between the person posing the query and the system building the relational query. This makes perfect sense – even in the much simpler world of keyword search systems, users iteratively refine their queries. Unfortunately, adopting this approach for RDBMS querying does not yield an easy problem – in fact, it uncovers a highly interesting and difficult challenge: how should the user and the system communicate in this iterative process? Answering this question is difficult. Unlike the case for keyword search systems, the answer to the query may not help the user know if the executed query was what they really wanted. For example, consider the simple query “find the difference between sales this year and last year.” In general the RDBMS will return a number – and it is very hard to tell just from that number if the query was correct or not. It would be far more precise for the system to respond to the user by presenting the generated SQL query itself. But this would require the person posing the natural language query to be able to read and understand SQL, which contradicts a major motivation for the system in the first place. Now we come to what is perhaps the heart of this paper: the decision to adopt an intermediate language the authors call “Query Tree,”a two-way domain-independent communication model allowing the user and system to understand one other. A query tree aids mapping a user query to its corresponding semantically correct SQL and translating a query plan to its corresponding natural language interpretation. The authors harness the schema knowledge, schema-driven similarity metrics, query tree reformulation and ranking to make the problem tractable for the system and the user. The authors close with a user study evaluating the approach. The user study itself is interesting, including the aspect of using Chinese to convey the queries to the subjects instead of English to avoid bias through the phrasing in the query description (presumably the subjects already spoke Chinese!) The experiments show that the approach is best for simple to medium complexity queries. This paper represent a significant improvement in the state of the art, and it is an ideal springboard for future advances. In an area as difficult and important as natural language querying of relational database systems, this is indeed a major contribution.\",\"PeriodicalId\":21740,\"journal\":{\"name\":\"SIGMOD Rec.\",\"volume\":\"35 1\",\"pages\":\"5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGMOD Rec.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2949741.2949743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2949741.2949743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

数据管理中的一个基本问题是如何查询关系数据库管理系统(rdbms)。理想情况下，查询接口应该足够强大，可以表达任意查询，但又足够简单，用户几乎不需要训练就能学会。自然语言是一种显而易见且吸引人的方法——大概大多数用户已经知道至少一种自然语言，并经常使用它来“查询”其他人。不幸的是，使用自然语言查询rdbms非常重要，并且在大多数情况下没有使用。然而，随着自然语言处理(NLP)系统的日益强大和无处不在，将NLP应用于数据库查询是有意义的。在最基本的层次上，关系数据库系统是使用SQL查询的。(就此而言，大多数“NoSQL”系统也使用SQL进行查询。)SQL非常强大和精确，而且对于新手来说，很难编写。因此，SQL不能作为用户界面，只能供高级用户使用。尽管如此，作为使用最广泛的RDBMS查询语言，SQL是翻译关系数据上的自然语言问题的最自然的语言。这个翻译是下面这篇论文的重点，“理解关系数据库上的自然语言查询”，作者是Li和Jagadish。本文作者做出的第一个重要决定是拒绝从自然语言查询到相应SQL查询的一次性单向翻译过程。相反，作者提倡在提出查询的人和构建关系查询的系统之间进行迭代对话。这是完全有道理的——即使在简单得多的关键字搜索系统中，用户也会迭代地改进他们的查询。不幸的是，采用这种方法进行RDBMS查询并没有产生一个简单的问题——事实上，它揭示了一个非常有趣和困难的挑战:在这个迭代过程中，用户和系统应该如何通信?回答这个问题很困难。与关键字搜索系统的情况不同，查询的答案可能无法帮助用户了解所执行的查询是否是他们真正想要的。例如，考虑一个简单的查询“查找今年和去年的销售额之间的差异”。一般来说，RDBMS将返回一个数字——仅从这个数字很难判断查询是否正确。对于系统来说，通过呈现生成的SQL查询本身来响应用户要精确得多。但是，这将要求提出自然语言查询的人能够阅读和理解SQL，这首先与系统的主要动机相矛盾。现在我们来到本文的核心:决定采用一种被作者称为“查询树”的中间语言，这是一种双向的、独立于领域的通信模型，允许用户和系统相互理解。查询树有助于将用户查询映射到相应的语义正确的SQL，并将查询计划转换为相应的自然语言解释。作者利用模式知识、模式驱动的相似度度量、查询树重构和排序，使问题对系统和用户都易于处理。作者以一项评估该方法的用户研究作为结束。用户研究本身很有趣，包括使用中文而不是英语向受试者传达查询，以避免通过查询描述中的措辞产生偏差(假设受试者已经说中文了!)实验表明，该方法最适合简单到中等复杂度的查询。这篇论文代表了技术水平的重大进步，是未来进步的理想跳板。在关系数据库系统的自然语言查询这样一个困难而重要的领域，这确实是一个重大的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Technical Perspective: Natural Language to SQL Translation by Iteratively Exploring a Middle Ground

A fundamental question in data management is how relational database management systems (RDBMSs) should be queried. Ideally, the query interface should be powerful enough to express arbitrary queries, yet simple enough to learn that users require virtually no training. Natural language is an obvious and appealing approach – presumably most users already know at least one natural language and use it to “query” other humans constantly. Unfortunately, employing natural language to query RDBMSs is highly nontrivial, and for the most part, not used. However, with the growing power and ubiquity of Natural Language Processing (NLP) systems, it makes sense to redouble efforts in applying NLP to database querying. At the most basic level, relational database systems are queried using SQL. (For that matter, most “NoSQL” systems are also queried using SQL.) SQL is very powerful and precise, and, for novices, very hard to write. So SQL cannot be used as a user interface for anyone but power users. Nonetheless, as the most widely used RDBMS query language, SQL is the most natural language into which to translate natural language questions over relational data. This translation is the focus of the following paper, “Understanding Natural Language Queries over Relational Databases”, by Li and Jagadish. The first important decision made by the authors of this paper is to reject a one-shot, one-way translation process from a natural language query to a corresponding SQL query. Instead, the authors advocate an iterative dialog between the person posing the query and the system building the relational query. This makes perfect sense – even in the much simpler world of keyword search systems, users iteratively refine their queries. Unfortunately, adopting this approach for RDBMS querying does not yield an easy problem – in fact, it uncovers a highly interesting and difficult challenge: how should the user and the system communicate in this iterative process? Answering this question is difficult. Unlike the case for keyword search systems, the answer to the query may not help the user know if the executed query was what they really wanted. For example, consider the simple query “find the difference between sales this year and last year.” In general the RDBMS will return a number – and it is very hard to tell just from that number if the query was correct or not. It would be far more precise for the system to respond to the user by presenting the generated SQL query itself. But this would require the person posing the natural language query to be able to read and understand SQL, which contradicts a major motivation for the system in the first place. Now we come to what is perhaps the heart of this paper: the decision to adopt an intermediate language the authors call “Query Tree,”a two-way domain-independent communication model allowing the user and system to understand one other. A query tree aids mapping a user query to its corresponding semantically correct SQL and translating a query plan to its corresponding natural language interpretation. The authors harness the schema knowledge, schema-driven similarity metrics, query tree reformulation and ranking to make the problem tractable for the system and the user. The authors close with a user study evaluating the approach. The user study itself is interesting, including the aspect of using Chinese to convey the queries to the subjects instead of English to avoid bias through the phrasing in the query description (presumably the subjects already spoke Chinese!) The experiments show that the approach is best for simple to medium complexity queries. This paper represent a significant improvement in the state of the art, and it is an ideal springboard for future advances. In an area as difficult and important as natural language querying of relational database systems, this is indeed a major contribution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIGMOD Rec.

自引率

0.00%

发文量