Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum
{"title":"Fides: Towards a Platform for Responsible Data Science","authors":"Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum","doi":"10.1145/3085504.3085530","DOIUrl":"https://doi.org/10.1145/3085504.3085530","url":null,"abstract":"Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias. In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126327468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGVCut: A Vertex-Cut Partitioning Tool for Random Walks-based Computations over Social Network graphs","authors":"Yifan Li, Camélia Constantin, C. Mouza","doi":"10.1145/3085504.3091114","DOIUrl":"https://doi.org/10.1145/3085504.3091114","url":null,"abstract":"Several distributed frameworks have recently emerged to perform computations on large-scale graphs. However some recent studies have highlighted that vertex-partitioning approaches, e.g. Giraph, failed to achieve workload-balanced partitioning for skewed graphs, typically having a heavy-tail degree distribution. While edge-partitioning approaches such as PowerGraph and GraphX provide beter balancing and performances for graph computation, they supply a generic framework, independent from the computation. This demonstration presents SGVCut to display our edge partitions designed for random walks-based computation, which is the foundation of many graph algorithms, on skewed graphs. The demonstration scenario introduces SGVCut interface and illustrates the benefits of our approach compare to other partitioning strategies for different settings and algorithms.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126443489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/3085504.3085512","DOIUrl":"https://doi.org/10.1145/3085504.3085512","url":null,"abstract":"Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative solutions.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132699799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Temporal Pattern Mining Using Efficient Batch-Free Stream Clustering","authors":"Yifeng Lu, Marwan Hassani, T. Seidl","doi":"10.1145/3085504.3085511","DOIUrl":"https://doi.org/10.1145/3085504.3085511","url":null,"abstract":"This paper address the problem of temporal pattern mining from multiple data streams containing temporal events. Temporal events are considered as real world events aligned with comprehensive starting and ending timing information rather than simple integer timestamps. Predefined relations, such as \"before\" and \"after\", describe the heterogeneous relationships hidden in temporal data with limited diversity. In this work, the relationships among events are learned dynamically from the temporal information. Each event is treated as an object with a label and numerical attributes. An online-offline model is used as the primary structure for analyzing the evolving multiple streams. Different distance functions on temporal events and sequences can be applied depending on the application scenario. A prefix tree is introduced for a fast incremental pattern update. Events in the real world usually persist for some period. It is more natural to model events as intervals with temporal information rather than as points on the timeline. Based on the representation proposed in this work, our approach can also be extended to handle interval data. Experiments show how the method, with richer information and more accurate results than the state-of-the-art, processes both point-based and interval-based event streams efficiently.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121810688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic k-Nearest Neighbor Monitoring of Moving Gaussians","authors":"Kostas Patroumpas, Christos Koutras","doi":"10.1145/3085504.3085525","DOIUrl":"https://doi.org/10.1145/3085504.3085525","url":null,"abstract":"We consider a centralized server that receives streaming updates from numerous moving objects regarding their current whereabouts. However, each object always relays its location cloaked into a broader uncertainty region under a Bivariate Gaussian model of varying densities. We wish to monitor a large number of continuous queries, each seeking k objects nearest to its own focal point with likelihood above a given threshold, e.g., \"which of my friends are currently the k = 3 closest to our preferred cafe with probability over 75%\". Since an exhaustive evaluation would be prohibitive, we develop heuristics based on spatial and probabilistic properties of the uncertainty model, and promptly issue approximate, yet reliable answers with confidence margins. We conducted a comprehensive empirical study to assess the performance and response quality of the proposed methodology, confirming that it can efficiently cope with large numbers of moving Gaussian objects under fluctuating uncertainty conditions, while also offering timely response with tolerable error to multiple queries of varying specifications.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"502 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129995851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Beckman, Tyler J. Skluzacek, K. Chard, Ian T Foster
{"title":"Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories","authors":"Paul Beckman, Tyler J. Skluzacek, K. Chard, Ian T Foster","doi":"10.1145/3085504.3091116","DOIUrl":"https://doi.org/10.1145/3085504.3091116","url":null,"abstract":"Scientists' capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma---an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122867642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Discovery of Inclusion Dependencies","authors":"Nuhad Shaabani, C. Meinel","doi":"10.1145/3085504.3085506","DOIUrl":"https://doi.org/10.1145/3085504.3085506","url":null,"abstract":"Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data profiling, data cleaning, entity resolution and schema matching. Their discovery in an unknown dataset is at the core of any data analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for applications on dynamic datasets, such as transactional datasets, scientific applications, and social network. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. Updating the clusters does not need to access the dataset because of special data structures designed to efficiently support the updating process. We perform an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116,200,000 million tuples. The results show that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996 % for both the insert and the delete.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"47 59","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120942156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mir Imtiaz Mostafiz, S. Mahmud, Muhammed Mas-ud Hussain, Mohammed Eunus Ali, Goce Trajcevski
{"title":"Class-based Conditional MaxRS Query in Spatial Data Streams","authors":"Mir Imtiaz Mostafiz, S. Mahmud, Muhammed Mas-ud Hussain, Mohammed Eunus Ali, Goce Trajcevski","doi":"10.1145/3085504.3085517","DOIUrl":"https://doi.org/10.1145/3085504.3085517","url":null,"abstract":"We address the problem of maintaining the correct answer-sets to the Conditional Maximizing Range-Sum (C-MaxRS) query in spatial data streams. Given a set of (possibly weighted) 2D point objects, the traditional MaxRS problem determines an optimal placement for an axes-parallel rectangle r so that the number -- or, the weighted sum -- of objects in its interior is maximized. In many practical settings, the objects from a particular set -- e.g., restaurants -- can be of distinct types -- e.g., fast-food, Asian, etc. The C-MaxRS problem deals with maximizing the overall sum, given class-based existential constraints, i.e., a lower bound on the count of objects of interests from particular classes. We first propose an efficient algorithm to the static C-MaxRS query, and extend the solution to handle dynamic (data streams) settings. Our experiments over datasets of up to 100,000 objects show that the proposed solutions provide significant efficiency benefits.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124655929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Saiful Islam, W. Rahayu, Chengfei Liu, Tarique Anwar, Bela Stantic
{"title":"Computing Influence of a Product through Uncertain Reverse Skyline","authors":"Md. Saiful Islam, W. Rahayu, Chengfei Liu, Tarique Anwar, Bela Stantic","doi":"10.1145/3085504.3085508","DOIUrl":"https://doi.org/10.1145/3085504.3085508","url":null,"abstract":"Understanding the influence of a product is crucially important for making informed business decisions. This paper introduces a new type of skyline queries, called uncertain reverse skyline, for measuring the influence of a probabilistic product in uncertain data settings. More specifically, given a dataset of probabilistic products P and a set of customers C, an uncertain reverse skyline of a probabilistic product q retrieves all customers c ∈ C which include q as one of their preferred products. We present efficient pruning ideas and techniques for processing the uncertain reverse skyline query of a probabilistic product using R-Tree data index. We also present an efficient parallel approach to compute the uncertain reverse skyline and influence score of a probabilistic product. Our approach significantly outperforms the baseline approach derived from the existing literature. The efficiency of our approach is demonstrated by conducting experiments with both real and synthetic datasets.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114910188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring Fairness in Ranked Outputs","authors":"Ke Yang, Julia Stoyanovich","doi":"10.1145/3085504.3085526","DOIUrl":"https://doi.org/10.1145/3085504.3085526","url":null,"abstract":"Ranking and scoring are ubiquitous. We consider the setting in which an institution, called a ranker, evaluates a set of individuals based on demographic, behavioral or other characteristics. The final output is a ranking that represents the relative quality of the individuals. While automatic and therefore seemingly objective, rankers can, and often do, discriminate against individuals and systematically disadvantage members of protected groups. This warrants a careful study of the fairness of a ranking scheme, to enable data science for social good applications, among others. In this paper we propose fairness measures for ranked outputs. We develop a data generation procedure that allows us to systematically control the degree of unfairness in the output, and study the behavior of our measures on these datasets. We then apply our proposed measures to several real datasets, and detect cases of bias. Finally, we show preliminary results of incorporating our ranked fairness measures into an optimization framework, and show potential for improving fairness of ranked outputs while maintaining accuracy. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/FairRank.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131566604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}