Fourth Workshop in Exploiting AI Techniques for Data Management最新文献

Pre-Trained Web Table Embeddings for Table Discovery 预训练的Web表嵌入表发现

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464892

Michael Günther, Maik Thiele, Julius Gonsior, Wolfgang Lehner

引用次数: 5

Balancing Familiarity and Curiosity in Data Exploration with Deep Reinforcement Learning 用深度强化学习平衡数据探索中的熟悉度和好奇心

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464884

Aurélien Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, S. Subramanian

引用次数: 7

Leveraging Approximate Constraints for Localized Data Error Detection 利用近似约束进行局部数据错误检测

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464888

Mohan Zhang, O. Schulte, Yudong Luo

{"title":"Leveraging Approximate Constraints for Localized Data Error Detection","authors":"Mohan Zhang, O. Schulte, Yudong Luo","doi":"10.1145/3464509.3464888","DOIUrl":"https://doi.org/10.1145/3464509.3464888","url":null,"abstract":"Error detection is key for data quality management. AI techniques can leverage user domain knowledge to identifying sets of erroneous records that conflict with domain knowledge. To represent a wide range of user domain knowledge, several recent papers have developed and utilized soft approximate constraints (ACs) that a data relation is expected to satisfy only to a certain degree, rather than completely. We introduce error localization, a new AI-based technique for enhancing error detection with ACs. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient optimization algorithm for error localization: identifying distinct error regions that violate a given AC the most, based on a recursive tree partitioning scheme. The tree representation describes different error regions in terms of data attributes that are easily interpreted by users (e.g., all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, based on four datasets containing both real world and synthetic errors, shows that error localization increases both accuracy and speed of error detection based on ACs.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tailored Regression for Learned Indexes: Logarithmic Error Regression 学习指标的定制回归:对数误差回归

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464891

Martin Eppert, Philipp Fent, Thomas Neumann

引用次数: 4

RUSLI: Real-time Updatable Spline Learned Index 实时可更新样条学习索引

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI: 10.1145/3464509.3464886

Mayank Mishra, Rekha Singhal

{"title":"RUSLI: Real-time Updatable Spline Learned Index","authors":"Mayank Mishra, Rekha Singhal","doi":"10.1145/3464509.3464886","DOIUrl":"https://doi.org/10.1145/3464509.3464886","url":null,"abstract":"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

LEA: A Learned Encoding Advisor for Column Stores LEA:列存储的学习编码顾问

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-05-18 DOI: 10.1145/3464509.3464885

Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska

引用次数: 7