Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner
{"title":"Pre-Trained Web Table Embeddings for Table Discovery","authors":"Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner","doi":"10.1145/3464509.3464892","DOIUrl":"https://doi.org/10.1145/3464509.3464892","url":null,"abstract":"Pre-trained word embedding models have become the de-facto standard to model text in state-of-the-art analysis tools and frameworks. However, while there are massive amounts of textual data stored in tables, word embedding models are usually pre-trained on large documents. This mismatch can lead to narrowed performance on tasks where text values in tables are analyzed. To improve analysis and retrieval tasks working with tabular data, we propose a novel embedding technique to be pre-trained directly on a large Web table corpus. In an experimental evaluation, we employ our models for various data analysis tasks on different data sources. Our evaluation shows that models using pre-trained Web table embeddings outperform the same models when applied to embeddings pre-trained on text. Moreover, we show that by using Web table embeddings state-of-the-art models for the investigated tasks can be outperformed.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"518 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133944727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian
{"title":"Balancing Familiarity and Curiosity in Data Exploration with Deep Reinforcement Learning","authors":"Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian","doi":"10.1145/3464509.3464884","DOIUrl":"https://doi.org/10.1145/3464509.3464884","url":null,"abstract":"The ability to find a set of records in Exploratory Data Analysis (EDA) hinges on the scattering of objects in the data set and the on users’ knowledge of data and their ability to express their needs. This yields a wide range of EDA scenarios and solutions that differ in the guidance they provide to users. In this paper, we investigate the interplay between modeling curiosity and familiarity in Deep Reinforcement Learning (DRL) and expressive data exploration operators. We formalize curiosity as intrinsic reward and familiarity as extrinsic reward. We examine the behavior of several policies learned for different weights for those rewards. Our experiments on SDSS, a very large sky survey data set1 provide several insights and justify the need for a deeper examination of combining DRL and data exploration operators that go beyond drill-downs and roll-ups.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"10 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126256952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Approximate Constraints for Localized Data Error Detection","authors":"Mohan Zhang, O. Schulte, Yudong Luo","doi":"10.1145/3464509.3464888","DOIUrl":"https://doi.org/10.1145/3464509.3464888","url":null,"abstract":"Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Tailored Regression for Learned Indexes: Logarithmic Error Regression","authors":"Martin Eppert, Philipp Fent, Thomas Neumann","doi":"10.1145/3464509.3464891","DOIUrl":"https://doi.org/10.1145/3464509.3464891","url":null,"abstract":"Although linear regressions are essential for learned index structures, most implementations use Simple Linear Regression, which optimizes the squared error. Since learned indexes use exponential search, regressions that optimize the logarithmic error are much better tailored for the use-case. By using this fitting optimization target, we can significantly improve learned index’s lookup performance with no architectural changes. While the log-error is harder to optimize, our novel algorithms and optimization heuristics can bring a practical performance improvement of the lookup latency. Even in cases where fast build times are paramount, log-error regressions still provide a robust fallback for degenerated leaf models. The resulting regressions are much better suited for learned indexes, and speed up lookups on data sets with outliers by over a factor of 2.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129730166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RUSLI: Real-time Updatable Spline Learned Index","authors":"Mayank Mishra, Rekha Singhal","doi":"10.1145/3464509.3464886","DOIUrl":"https://doi.org/10.1145/3464509.3464886","url":null,"abstract":"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LEA: A Learned Encoding Advisor for Column Stores","authors":"Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska","doi":"10.1145/3464509.3464885","DOIUrl":"https://doi.org/10.1145/3464509.3464885","url":null,"abstract":"Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user’s database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}