Bhavya Karki, Fan Hu, Nithin Haridas, S. Barot, Zihua Liu, Lucile Callebert, Matthias Grabmair, A. Tomasic
{"title":"Question answering via web extracted tables","authors":"Bhavya Karki, Fan Hu, Nithin Haridas, S. Barot, Zihua Liu, Lucile Callebert, Matthias Grabmair, A. Tomasic","doi":"10.1145/3329859.3329879","DOIUrl":"https://doi.org/10.1145/3329859.3329879","url":null,"abstract":"Question answering (QA) provides answers to a wide range of questions but is still limited in the complexity of reasoning and the breadth of accessible data sources. In this paper, we describe a dataset and baseline results for a question answering system that utilizes web tables. The dataset is derived from commonly asked questions on the web, and their corresponding answers found in tables on websites. Our dataset is novel in that every question is paired with a table of a different signature, so learning must automatically generalize across domains. Each QA training instance comprises a table, a natural language question, and a corresponding structured SQL query. We build our model by dividing question answering into a sequence of tasks, including table retrieval and question element classification, and conduct experiments to measure the performance of each task. In a traditional machine learning design manner, we extract various features specific to each task, apply a neural model, and then compose a full pipeline which constructs the SQL query from its parts. Our work provides quantitative results and error analysis for each task, and identifies in detail the reasoning required to generate SQL expressions from natural language questions. This analysis of reasoning informs future models based on neural machine learning.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127556813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangjun Sheng, A. Tomasic, Tieying Zhang, Andrew Pavlo
{"title":"Scheduling OLTP transactions via learned abort prediction","authors":"Yangjun Sheng, A. Tomasic, Tieying Zhang, Andrew Pavlo","doi":"10.1145/3329859.3329871","DOIUrl":"https://doi.org/10.1145/3329859.3329871","url":null,"abstract":"Current main memory database system architectures are still challenged by high contention workloads and this challenge will continue to grow as the number of cores in processors continues to increase [23]. These systems schedule transactions randomly across cores to maximize concurrency and to produce a uniform load across cores. Scheduling never considers potential conflicts. Performance could be improved if scheduling balanced between concurrency to maximize throughput and scheduling transactions linearly to avoid conflicts. In this paper, we present the design of several intelligent transaction scheduling algorithms that consider both potential transaction conflicts and concurrency. To incorporate reasoning about transaction conflicts, we develop a supervised machine learning model that estimates the probability of conflict. This model is incorporated into several scheduling algorithms. In addition, we integrate an unsupervised machine learning algorithm into an intelligent scheduling algorithm. We then empirically measure the performance impact of different scheduling algorithms on OLTP and social networking workloads. Our results show that, with appropriate settings, intelligent scheduling can increase throughput by 54% and reduce abort rate by 80% on a 20-core machine, relative to random scheduling. In summary, the paper provides preliminary evidence that intelligent scheduling significantly improves DBMS performance.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to optimize federated queries","authors":"Liqi Xu, R. Cole, Daniel Ting","doi":"10.1145/3329859.3329873","DOIUrl":"https://doi.org/10.1145/3329859.3329873","url":null,"abstract":"Query optimization is challenging for any database system, even with a clear understanding of its inner workings. Consider then, query planning for a federation of third-party data sources where little detail is known. This is exactly the challenge of orchestrating data execution and movement faced by Tableau's cross-database joins feature, where the data of a query originates from two or more data sources. In this paper, we present our work on using machine learning techniques to address one of the most fundamental challenges in federated query optimization: the dynamic designation of a federation engine. Our machine learning model learns the performance and data characteristics of a system by extracting features from query plans. We further extend the ability of our model to manipulate database settings on a per query level. Our experimental results demonstrate that we can achieve a speedup of up to 10.7x compared to an existing federated query optimizer.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127701776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincenzo Di Cicco, D. Firmani, Nick Koudas, P. Merialdo, D. Srivastava
{"title":"Interpreting deep learning models for entity resolution: an experience report using LIME","authors":"Vincenzo Di Cicco, D. Firmani, Nick Koudas, P. Merialdo, D. Srivastava","doi":"10.1145/3329859.3329878","DOIUrl":"https://doi.org/10.1145/3329859.3329878","url":null,"abstract":"Entity Resolution (ER) seeks to understand which records refer to the same entity (e.g., matching products sold on multiple websites). The sheer number of ways humans represent and misrepresent information about real-world entities makes ER a challenging problem. Deep Learning (DL) has provided impressive results in the field of natural language processing, thus recent works started exploring DL approaches to the ER problem, with encouraging results. However, we are still far from understanding why and when these approaches work in the ER setting. We are developing a methodology, Mojito, to produce explainable interpretations of the output of DL models for the ER task. Our methodology is based on LIME, a popular tool for producing prediction explanations for generic classification tasks. In this paper we report our first experiences in interpreting recent DL models for the ER task. Our results demonstrate the importance of explanations in the DL space, and suggest that, when assessing performance of DL algorithms for ER, accuracy alone may not be sufficient to demonstrate generality and reproducibility in a production environment.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132137426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Considerations for handling updates in learned index structures","authors":"A. Hadian, T. Heinis","doi":"10.1145/3329859.3329874","DOIUrl":"https://doi.org/10.1145/3329859.3329874","url":null,"abstract":"Machine learned models have recently been suggested as a rival for index structures such as B-trees and hash tables. An optimized learned index potentially has a significantly smaller memory footprint compared to its algorithmic counterparts, which alleviates the relatively high computational complexity of ML models. One unexplored aspect of learned index structures, however, is handling updates to the data and hence the model. In this paper we therefore discuss updates to the data and their implications for the model. Moreover, we suggest a method for eliminating the drift - the error of learned index models caused by the updates to the index- so that the learned model can maintain its performance under higher update rates.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116662927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, Wolfgang Lehner
{"title":"Cardinality estimation with local deep learning models","authors":"Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, Wolfgang Lehner","doi":"10.1145/3329859.3329875","DOIUrl":"https://doi.org/10.1145/3329859.3329875","url":null,"abstract":"Cardinality estimation is a fundamental task in database query processing and optimization. Unfortunately, the accuracy of traditional estimation techniques is poor resulting in non-optimal query execution plans. With the recent expansion of machine learning into the field of data management, there is the general notion that data analysis, especially neural networks, can lead to better estimation accuracy. Up to now, all proposed neural network approaches for the cardinality estimation follow a global approach considering the whole database schema at once. These global models are prone to sparse data at training leading to misestimates for queries which were not represented in the sample space used for generating training queries. To overcome this issue, we introduce a novel local-oriented approach in this paper, therefore the local context is a specific sub-part of the schema. As we will show, this leads to better representation of data correlation and thus better estimation accuracy. Compared to global approaches, our novel approach achieves an improvement by two orders of magnitude in accuracy and by a factor of four in training time performance for local models.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"16 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123580966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards learning a partitioning advisor with deep reinforcement learning","authors":"Benjamin Hilprecht, Carsten Binnig, Uwe Röhm","doi":"10.1145/3329859.3329876","DOIUrl":"https://doi.org/10.1145/3329859.3329876","url":null,"abstract":"In this paper we introduce a partitioning advisor for analytical workloads based on Deep Reinforcement Learning. In contrast to existing approaches for automated partitioning design, an RL agent learns its decisions based on experience by trying out different partitionings and monitoring the rewards for different workloads. In our experimental evaluation with a distributed database and various complex schemata, we show that our learned partitioning advisor is thus not only able to find partitionings that outperform existing approaches for automated data partitioning but is also able to find non-obvious partitionings.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126639914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Termite: a system for tunneling through heterogeneous data","authors":"R. Fernandez, S. Madden","doi":"10.1145/3329859.3329877","DOIUrl":"https://doi.org/10.1145/3329859.3329877","url":null,"abstract":"Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remains an open challenge. In this paper, we advocate a different approach: rather than trying to infer a common schema, we aim to find another common representation for diverse, heterogeneous data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness. We present Termite, a prototype we have built to learn the best embedding from the data. Because the best representation is learned, this allows Termite to avoid much of the human effort associated with traditional data integration tasks. On top of Termite, we have implemented a Termite-Join operator, which allows people to identify related concepts, even when these are stored in databases with different schemas and in unstructured data such as text files, webpages, etc. Finally, we show preliminary evaluation results of our prototype via a user study, and describe a list of future directions we have identified.","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114161104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","authors":"","doi":"10.1145/3329859","DOIUrl":"https://doi.org/10.1145/3329859","url":null,"abstract":"","PeriodicalId":118194,"journal":{"name":"Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121202920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}