Proceedings of the 29th International Conference on Scientific and Statistical Database Management最新文献_第4页

Fides: Towards a Platform for Responsible Data Science Fides:迈向负责任的数据科学平台

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085530

Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum

{"title":"Fides: Towards a Platform for Responsible Data Science","authors":"Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum","doi":"10.1145/3085504.3085530","DOIUrl":"https://doi.org/10.1145/3085504.3085530","url":null,"abstract":"Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias. In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126327468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

SGVCut: A Vertex-Cut Partitioning Tool for Random Walks-based Computations over Social Network graphs SGVCut:一个顶点切割划分工具，用于基于随机行走的社交网络图计算

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3091114

Yifan Li, Camélia Constantin, C. Mouza

引用次数: 6

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics 点积连接:用于大模型分析的可扩展数据库内线性代数

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085512

Chengjie Qin, Florin Rusu

{"title":"Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/3085504.3085512","DOIUrl":"https://doi.org/10.1145/3085504.3085512","url":null,"abstract":"Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative solutions.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132699799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Incremental Temporal Pattern Mining Using Efficient Batch-Free Stream Clustering 基于高效无批处理流聚类的增量时态模式挖掘

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085511

Yifeng Lu, Marwan Hassani, T. Seidl

{"title":"Incremental Temporal Pattern Mining Using Efficient Batch-Free Stream Clustering","authors":"Yifeng Lu, Marwan Hassani, T. Seidl","doi":"10.1145/3085504.3085511","DOIUrl":"https://doi.org/10.1145/3085504.3085511","url":null,"abstract":"This paper address the problem of temporal pattern mining from multiple data streams containing temporal events. Temporal events are considered as real world events aligned with comprehensive starting and ending timing information rather than simple integer timestamps. Predefined relations, such as \"before\" and \"after\", describe the heterogeneous relationships hidden in temporal data with limited diversity. In this work, the relationships among events are learned dynamically from the temporal information. Each event is treated as an object with a label and numerical attributes. An online-offline model is used as the primary structure for analyzing the evolving multiple streams. Different distance functions on temporal events and sequences can be applied depending on the application scenario. A prefix tree is introduced for a fast incremental pattern update. Events in the real world usually persist for some period. It is more natural to model events as intervals with temporal information rather than as points on the timeline. Based on the representation proposed in this work, our approach can also be extended to handle interval data. Experiments show how the method, with richer information and more accurate results than the state-of-the-art, processes both point-based and interval-based event streams efficiently.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121810688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Probabilistic k-Nearest Neighbor Monitoring of Moving Gaussians 移动高斯函数的概率k近邻监测

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085525

Kostas Patroumpas, Christos Koutras

{"title":"Probabilistic k-Nearest Neighbor Monitoring of Moving Gaussians","authors":"Kostas Patroumpas, Christos Koutras","doi":"10.1145/3085504.3085525","DOIUrl":"https://doi.org/10.1145/3085504.3085525","url":null,"abstract":"We consider a centralized server that receives streaming updates from numerous moving objects regarding their current whereabouts. However, each object always relays its location cloaked into a broader uncertainty region under a Bivariate Gaussian model of varying densities. We wish to monitor a large number of continuous queries, each seeking k objects nearest to its own focal point with likelihood above a given threshold, e.g., \"which of my friends are currently the k = 3 closest to our preferred cafe with probability over 75%\". Since an exhaustive evaluation would be prohibitive, we develop heuristics based on spatial and probabilistic properties of the uncertainty model, and promptly issue approximate, yet reliable answers with confidence margins. We conducted a comprehensive empirical study to assess the performance and response quality of the proposed methodology, confirming that it can efficiently cope with large numbers of moving Gaussian objects under fluctuating uncertainty conditions, while also offering timely response with tolerable error to multiple queries of varying specifications.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"502 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129995851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories Skluma:驯服杂乱数据存储库的统计学习管道

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3091116

Paul Beckman, Tyler J. Skluzacek, K. Chard, Ian T Foster

引用次数: 14

Incremental Discovery of Inclusion Dependencies 增量发现包含依赖关系

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085506

Nuhad Shaabani, C. Meinel

{"title":"Incremental Discovery of Inclusion Dependencies","authors":"Nuhad Shaabani, C. Meinel","doi":"10.1145/3085504.3085506","DOIUrl":"https://doi.org/10.1145/3085504.3085506","url":null,"abstract":"Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data profiling, data cleaning, entity resolution and schema matching. Their discovery in an unknown dataset is at the core of any data analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for applications on dynamic datasets, such as transactional datasets, scientific applications, and social network. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. Updating the clusters does not need to access the dataset because of special data structures designed to efficiently support the updating process. We perform an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116,200,000 million tuples. The results show that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996 % for both the insert and the delete.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"47 59","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120942156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Class-based Conditional MaxRS Query in Spatial Data Streams 空间数据流中基于类的条件MaxRS查询

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085517

Mir Imtiaz Mostafiz, S. Mahmud, Muhammed Mas-ud Hussain, Mohammed Eunus Ali, Goce Trajcevski

引用次数: 8

Computing Influence of a Product through Uncertain Reverse Skyline 不确定逆向天际线对产品影响的计算

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-02-21 DOI: 10.1145/3085504.3085508

Md. Saiful Islam, W. Rahayu, Chengfei Liu, Tarique Anwar, Bela Stantic

引用次数: 7

Measuring Fairness in Ranked Outputs 衡量排名产出的公平性

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2016-10-26 DOI: 10.1145/3085504.3085526

Ke Yang, Julia Stoyanovich

{"title":"Measuring Fairness in Ranked Outputs","authors":"Ke Yang, Julia Stoyanovich","doi":"10.1145/3085504.3085526","DOIUrl":"https://doi.org/10.1145/3085504.3085526","url":null,"abstract":"Ranking and scoring are ubiquitous. We consider the setting in which an institution, called a ranker, evaluates a set of individuals based on demographic, behavioral or other characteristics. The final output is a ranking that represents the relative quality of the individuals. While automatic and therefore seemingly objective, rankers can, and often do, discriminate against individuals and systematically disadvantage members of protected groups. This warrants a careful study of the fairness of a ranking scheme, to enable data science for social good applications, among others. In this paper we propose fairness measures for ranked outputs. We develop a data generation procedure that allows us to systematically control the degree of unfairness in the output, and study the behavior of our measures on these datasets. We then apply our proposed measures to several real datasets, and detect cases of bias. Finally, we show preliminary results of incorporating our ranked fairness measures into an optimization framework, and show potential for improving fairness of ranked outputs while maintaining accuracy. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/FairRank.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131566604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 294