Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, Christopher Ré
{"title":"Fonduer: Knowledge Base Construction from Richly Formatted Data.","authors":"Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, Christopher Ré","doi":"10.1145/3183713.3183729","DOIUrl":"https://doi.org/10.1145/3183713.3183729","url":null,"abstract":"<p><p>We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base-and in some cases produces up to 1.87× the number of correct entries-compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2018 ","pages":"1301-1316"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3183713.3183729","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36253180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Submodularity of Distributed Join Computation.","authors":"Rundong Li, Mirek Riedewald, Xinyan Deng","doi":"10.1145/3183713.3183728","DOIUrl":"https://doi.org/10.1145/3183713.3183728","url":null,"abstract":"<p><p>We study distributed equi-join computation in the presence of join-attribute skew, which causes load imbalance. Skew can be addressed by more fine-grained partitioning, at the cost of input duplication. For random load assignment, e.g., using a hash function, fine-grained partitioning creates a tradeoff between load expectation and variance. We show that minimizing load variance subject to a constraint on expectation is a monotone submodular maximization problem with Knapsack constraints, hence admitting provably near-optimal greedy solutions. In contrast to previous work on formal optimality guarantees, we can prove this result also for self-joins and more general load functions defined as weighted sum of input and output. We further demonstrate through experiments that this theoretical result leads to an effective algorithm for the problem of minimizing running time, even when load is assigned deterministically.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2018 ","pages":"1237-1252"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3183713.3183728","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37216551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Uniform Tuning Problem on SQL-On-Hadoop Query Processing","authors":"Edson Ramiro Lucas Filho","doi":"10.1145/3055167.3055172","DOIUrl":"https://doi.org/10.1145/3055167.3055172","url":null,"abstract":"SQL-On-Hadoop systems translate a given query into several MapReduce jobs. Each job executes a different set of query operators over different input data sets, which leads to distinct resource consumption patterns. Once each job has a different resource consumption pattern they should receive tailor made tuning setup. However, SQL-On-Hadoop systems propagate the same tuning to every job in the query plan because they are not able to apply a specific tuning setup per job. Propagating the same tuning through the query plan is a problem because it drives the query to sub-optimal performance and drives tuning advisors to re-profile similar jobs several times. In our research we characterize this problem and propose a solution. Preliminary results show that our approach can reduce the number of profiles required by tuning advisors in 67% for TPC-H.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"1 1","pages":"22-24"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82988952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ce Zhang, Jaeho Shin, Christopher Ré, Michael Cafarella, Feng Niu
{"title":"Extracting Databases from Dark Data with DeepDive.","authors":"Ce Zhang, Jaeho Shin, Christopher Ré, Michael Cafarella, Feng Niu","doi":"10.1145/2882903.2904442","DOIUrl":"10.1145/2882903.2904442","url":null,"abstract":"<p><p>DeepDive is a system for extracting relational databases from <i>dark data</i>: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of \"big data.\" DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2016 ","pages":"847-859"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5350112/pdf/nihms-826684.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34832482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, Carlo Zaniolo
{"title":"Big Data Analytics with Datalog Queries on Spark.","authors":"Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, Carlo Zaniolo","doi":"10.1145/2882903.2915229","DOIUrl":"https://doi.org/10.1145/2882903.2915229","url":null,"abstract":"<p><p>There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2016 ","pages":"1135-1149"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2882903.2915229","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35099850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Incremental Query Re-Optimization.","authors":"Mengmeng Liu, Zachary G Ives, Boon Thau Loo","doi":"10.1145/2882903.2915212","DOIUrl":"https://doi.org/10.1145/2882903.2915212","url":null,"abstract":"<p><p>As declarative query processing techniques expand to the Web, data streams, network routers, and cloud platforms, there is an increasing need to re-plan execution in the presence of unanticipated performance changes. New runtime information may affect which query plan we prefer to run. Adaptive techniques require innovation both in terms of the <i>algorithms used to estimate costs</i>, and in terms of the <i>search algorithm</i> that finds the best plan. We investigate how to build a cost-based optimizer that recomputes the optimal plan <i>incrementally</i> given new cost information, much as a stream engine constantly updates its outputs given new data. Our implementation especially shows benefits for stream processing workloads. It lays the foundations upon which a variety of novel adaptive optimization algorithms can be built. We start by leveraging the recently proposed approach of formulating query plan enumeration as a set of <i>recursive datalog queries</i>; we develop a variety of novel optimization approaches to ensure effective pruning in both static and incremental cases. We further show that the lessons learned in the declarative implementation can be equally applied to more traditional optimizer implementations.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2016 ","pages":"1705-1720"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2882903.2915212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35129249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yonathan Perez, Rok Sosič, Arijit Banerjee, Rohan Puttagunta, Martin Raison, Pararth Shah, Jure Leskovec
{"title":"Ringo: Interactive Graph Analytics on Big-Memory Machines.","authors":"Yonathan Perez, Rok Sosič, Arijit Banerjee, Rohan Puttagunta, Martin Raison, Pararth Shah, Jure Leskovec","doi":"10.1145/2723372.2735369","DOIUrl":"https://doi.org/10.1145/2723372.2735369","url":null,"abstract":"<p><p>We present Ringo, a system for analysis of large graphs. Graphs provide a way to represent and analyze systems of interacting objects (people, proteins, webpages) with edges between the objects denoting interactions (friendships, physical interactions, links). Mining graphs provides valuable insights about individual objects as well as the relationships among them. In building Ringo, we take advantage of the fact that machines with large memory and many cores are widely available and also relatively affordable. This allows us to build an easy-to-use interactive high-performance graph analytics system. Graphs also need to be built from input data, which often resides in the form of relational tables. Thus, Ringo provides rich functionality for manipulating raw input data tables into various kinds of graphs. Furthermore, Ringo also provides over 200 graph analytics functions that can then be applied to constructed graphs. We show that a single big-memory machine provides a very attractive platform for performing analytics on all but the largest graphs as it offers excellent performance and ease of use as compared to alternative approaches. With Ringo, we also demonstrate how to integrate graph analytics with an iterative process of trial-and-error data exploration and rapid experimentation, common in data mining workloads.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2015 ","pages":"1105-1110"},"PeriodicalIF":0.0,"publicationDate":"2015-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2723372.2735369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34404726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Network Coding Based Framework for Construction of Systematic Minimum Bandwidth Regenerating (MBR) Codes for Distributed Storage","authors":"S. Kadhe, M. Chandra, B. Janakiram","doi":"10.5555/2694476.2694489","DOIUrl":"https://doi.org/10.5555/2694476.2694489","url":null,"abstract":"Regenerating codes are a family of erasure correcting codes that are primarily designed to minimize the amount of data required to be downloaded to repair a failed node in a distributed storage system.In this article, the construction of systematic Minimum Bandwidth Regenerating (MBR) codes based on random network coding, is presented. The repair model considered is the hybrid repair model, wherein, the source (message) symbols are exactly replicated, while the redundant (parity) symbols are replaced by their functionally equivalent symbols. It is showed that the random network coding based constructions can preserve the practically favorable systematic feature and still achieve the optimal trade off between storage and repair bandwidth, if the coding is performed by combining the judiciously selected source symbols. Unlike most of the schemes present in the literature, the proposed constructions do not pose any restriction on the number of nodes participating in repair or on the total number of nodes, and thus add reconfigurability to the system. Moreover, during the repair of systematic nodes, the proposed codes require less number of disk reads compared to most of the codes in the literature.In the second half of the article, it is proven that the proposed constructions satisfy the necessary subspace properties of a linear exact regenerating code that are established in the literature. Further, rigorous analytical study of the effect of Galois field size on the probability of successful regeneration and reconstruction is carried out, and the results are validated using the numerical simulations.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"46 1","pages":"45-55"},"PeriodicalIF":0.0,"publicationDate":"2013-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87846542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polarity Trend Analysis of Public Sentiment on YouTube","authors":"Amar Krishna, Joseph Zambreno, Sandeep Krishnan","doi":"10.31274/ETD-180810-146","DOIUrl":"https://doi.org/10.31274/ETD-180810-146","url":null,"abstract":"For the past several years YouTube has been by far the largest user-driven online video provider. While many of these videos contain a significant number of user comments, little work has been done to date in extracting trends from these comments because of their low information consistency and quality. In this paper we perform sentiment analysis of the YouTube comments related to popular topics using machine learning techniques. We demonstrate that an analysis of the sentiments to identify their trends, seasonality and forecasts can provide a clear picture of the influence of real-world events on user sentiments.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"28 1","pages":"125-128"},"PeriodicalIF":0.0,"publicationDate":"2013-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80194107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Woon-Hak Kang, Sang-Won Lee, Bongki Moon, Gihwan Oh, Changwoo Min
{"title":"X-FTL: transactional FTL for SQLite databases","authors":"Woon-Hak Kang, Sang-Won Lee, Bongki Moon, Gihwan Oh, Changwoo Min","doi":"10.1145/2463676.2465326","DOIUrl":"https://doi.org/10.1145/2463676.2465326","url":null,"abstract":"In the era of smartphones and mobile computing, many popular applications such as Facebook, twitter, Gmail, and even Angry birds game manage their data using SQLite. This is mainly due to the development productivity and solid transactional support. For transactional atomicity, however, SQLite relies on less sophisticated but costlier page-oriented journaling mechanisms. Hence, this is often cited as the main cause of tardy responses in mobile applications.\u0000 Flash memory does not allow data to be updated in place, and the copy-on-write strategy is adopted by most flash storage devices. In this paper, we propose X-FTL, a transactional flash translation layer(FTL) for SQLite databases. By offloading the burden of guaranteeing the transactional atomicity from a host system to flash storage and by taking advantage of the copy-on-write strategy used in modern FTLs, X-FTL drastically improves the transactional throughput almost for free without resorting to costly journaling schemes. We have implemented X-FTL on an SSD development board called OpenSSD, and modified SQLite and ext4 file system minimally to make them compatible with the extended abstractions provided by X-FTL. We demonstrate the effectiveness of X-FTL using real and synthetic SQLite workloads for smartphone applications, TPC-C benchmark for OLTP databases, and FIO benchmark for file systems.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"8 1","pages":"97-108"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81576235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}