Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman
{"title":"A System and Benchmark for LLM-based Q\\&A on Heterogeneous Data","authors":"Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman","doi":"arxiv-2409.05735","DOIUrl":null,"url":null,"abstract":"In many industrial settings, users wish to ask questions whose answers may be\nfound in structured data sources such as a spreadsheets, databases, APIs, or\ncombinations thereof. Often, the user doesn't know how to identify or access\nthe right data source. This problem is compounded even further if multiple (and\npotentially siloed) data sources must be assembled to derive the answer.\nRecently, various Text-to-SQL applications that leverage Large Language Models\n(LLMs) have addressed some of these problems by enabling users to ask questions\nin natural language. However, these applications remain impractical in\nrealistic industrial settings because they fail to cope with the data source\nheterogeneity that typifies such environments. In this paper, we address\nheterogeneity by introducing the siwarex platform, which enables seamless\nnatural language access to both databases and APIs. To demonstrate the\neffectiveness of siwarex, we extend the popular Spider dataset and benchmark by\nreplacing some of its tables by data retrieval APIs. We find that siwarex does\na good job of coping with data source heterogeneity. Our modified Spider\nbenchmark will soon be available to the research community","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In many industrial settings, users wish to ask questions whose answers may be
found in structured data sources such as a spreadsheets, databases, APIs, or
combinations thereof. Often, the user doesn't know how to identify or access
the right data source. This problem is compounded even further if multiple (and
potentially siloed) data sources must be assembled to derive the answer.
Recently, various Text-to-SQL applications that leverage Large Language Models
(LLMs) have addressed some of these problems by enabling users to ask questions
in natural language. However, these applications remain impractical in
realistic industrial settings because they fail to cope with the data source
heterogeneity that typifies such environments. In this paper, we address
heterogeneity by introducing the siwarex platform, which enables seamless
natural language access to both databases and APIs. To demonstrate the
effectiveness of siwarex, we extend the popular Spider dataset and benchmark by
replacing some of its tables by data retrieval APIs. We find that siwarex does
a good job of coping with data source heterogeneity. Our modified Spider
benchmark will soon be available to the research community