{"title":"Federated database system for scientific data","authors":"Sangchul Kim, Bongki Moon","doi":"10.1145/3221269.3222332","DOIUrl":"https://doi.org/10.1145/3221269.3222332","url":null,"abstract":"Much like traditional databases, scientific data are managed in multiple separate databases by different sources and organizations. When such distributed data are analyzed together for more comprehensive understanding and prediction, it is necessary to access data via multiple simultaneous connections or collected in a single location. The inevitable consequence is, however, that a significant overhead is incurred due to differences in schemas, data transformation, and extraneous cost for storing intermediate data. This demo presents SDF, Scientific Database in Federation, which facilitates data sharing and exchange in order to support complex analytics with minimal integration overhead. SDF is currently implemented in SciDB using user-defined operators, providing two connection models, master-to-master and cluster-to-master, for a shared-nothing architecture.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127536050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Wahl, Gregor Endler, Peter K. Schwab, Sebastian Herbst, Julian Rith, R. Lenz
{"title":"Crossing an OCEAN of queries: analyzing SQL query logs with OCEANLog","authors":"A. Wahl, Gregor Endler, Peter K. Schwab, Sebastian Herbst, Julian Rith, R. Lenz","doi":"10.1145/3221269.3223025","DOIUrl":"https://doi.org/10.1145/3221269.3223025","url":null,"abstract":"SQL queries encapsulate the knowledge of their authors about the usage of the queried data sources. This knowledge also contains aspects that cannot be inferred by analyzing the contents of the queried data sources alone. Due to the complexity of analytical SQL queries, specialized mechanisms are necessary to enable the user-friendly formulation of meta-queries over an existing query log. Currently existing approaches do not sufficiently consider syntactic and semantic aspects of queries along with contextual information. During our demonstration, conference participants learn how to use the latest release of OCEANLog, a framework for analyzing SQL query logs. Our demonstration encompasses several scenarios. Participants can explore an existing query log using domain-specific graph traversal expressions, set up continuous subscriptions for changes in the graph, create time-based visualizations of query results, configure an OCEANLog instance and learn how to choose a decide which specific graph database to use. We also provide them with access to the native meta-query mechanisms of a DBMS to further emphasize the benefits of our graph-based approach.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130759974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Publishing spatial histograms under differential privacy","authors":"S. Ghane, L. Kulik, K. Ramamohanarao","doi":"10.1145/3221269.3223039","DOIUrl":"https://doi.org/10.1145/3221269.3223039","url":null,"abstract":"Studying trajectories of individuals has received growing interest. The aggregated movement behaviour of people provides important insights about their habits, interests, and lifestyles. Understanding and utilizing trajectory data is a crucial part of many applications such as location based services, urban planning, and traffic monitoring systems. Spatial histograms and spatial range queries are key components in such applications to efficiently store and answer queries on trajectory data. A spatial histogram maintains the sequentiality of location points in a trajectory by a strong sequential dependency among histogram cells. This dependency is an essential property in answering spatial range queries. However, the trajectories of individuals are unique and even aggregating them in spatial histograms cannot completely ensure an individual's privacy. A key technique to ensure privacy for data publishing ϵ-differential privacy as it provides a strong guarantee on an individual's provided data. Our work is the first that guarantees ϵ-differential privacy for spatial histograms on trajectories, while ensuring the sequentiality of trajectory data, i.e., its consistency. Consistency is key for any database and our proposed mechanism, PriSH, synthesizes a spatial histogram and ensures the consistency of published histogram with respect to the strong dependency constraint. In extensive experiments on real and synthetic datasets, we show that (1) PriSH is highly scalable with the dataset size and granularity of the space decomposition, (2) the distribution of aggregate trajectory information in the synthesized histogram accurately preserves the distribution of original histogram, and (3) the output has high accuracy in answering arbitrary spatial range queries.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123112111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified framework of density-based clustering for semi-supervised classification","authors":"J. C. Gertrudes, A. Zimek, J. Sander, R. Campello","doi":"10.1145/3221269.3223037","DOIUrl":"https://doi.org/10.1145/3221269.3223037","url":null,"abstract":"Semi-supervised classification is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we introduce a unified framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. Experimental results on a large collection of datasets show the advantages of the proposed framework.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131206704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Metadata-driven error detection","authors":"L. Visengeriyeva, Ziawasch Abedjan","doi":"10.1145/3221269.3223028","DOIUrl":"https://doi.org/10.1145/3221269.3223028","url":null,"abstract":"Scientific data often originates from multiple sources and human agents. The integration of data from different sources must also resolve data quality problems that might occur because of inconsistency or different quality assurance levels of the sources. To identify various data quality problems in a dataset, it is necessary to use several error detection methods. Existing error detection solutions are usually tailored towards one specific type of data errors, such as rule violations or outliers, requiring the application of multiple strategies. Using all possible error detection methods is also not satisfying, as some systems might perform poorly on a particular dataset by producing a large number of false positives and missing some results. However, it is not trivial to assess the effectiveness of each strategy upfront. We propose two new holistic approaches for effectively combining off-the-shelf error detection systems. Our approaches are learning-based and incorporate metadata extracted from the dataset at hand. We empirically show, using four real-world datasets, that our method of combining error-detecting strategies achieves an average F1 score 15% higher than multiple heuristics-based baselines.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127657571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erick Cuenca, A. Sallaberry, D. Ienco, P. Poncelet
{"title":"Visual querying of large multilayer graphs","authors":"Erick Cuenca, A. Sallaberry, D. Ienco, P. Poncelet","doi":"10.1145/3221269.3223027","DOIUrl":"https://doi.org/10.1145/3221269.3223027","url":null,"abstract":"Many real world data can be represented by a network with a set of nodes linked each other by multiple relations. Such a rich graph is called multilayer graph. In this demo, we present a tool for Visual Querying of Large Multilayer Graphs that allows to visually draw the query, retrieve result patterns and finally navigate and browse the results considering the original multilayer graph database. Our approach does not only provide a graphical user interface for the graph engine but the query processing is fully integrated.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132317964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dario Colazzo, Vincenzo Mecca, Maurizio Nolé, C. Sartiani
{"title":"PathGraph","authors":"Dario Colazzo, Vincenzo Mecca, Maurizio Nolé, C. Sartiani","doi":"10.1145/3221269.3222331","DOIUrl":"https://doi.org/10.1145/3221269.3222331","url":null,"abstract":"With the widespread diffusion of social networks and the dawn of data-intensive scientific applications, graphs became one of the foundations for modern data management applications. A key role in graph querying and analysis is played by Regular Path Queries, their extensions, and, in particular, GXPath. In this demo we will present PathGraph, a distributed GXPath query processor, and its web-based graphical interface.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124342482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Lackner, Andreas Spitz, M. Weidemüller, Michael Gertz
{"title":"Efficient anti-community detection in complex networks","authors":"Sebastian Lackner, Andreas Spitz, M. Weidemüller, Michael Gertz","doi":"10.1145/3221269.3221289","DOIUrl":"https://doi.org/10.1145/3221269.3221289","url":null,"abstract":"Modeling the relations between the components of complex systems as networks of vertices and edges is a commonly used method in many scientific disciplines that serves to obtain a deeper understanding of the systems themselves. In particular, the detection of densely connected communities in these networks is frequently used to identify functionally related components, such as social circles in networks of personal relations or interactions between agents in biological networks. Traditionally, communities are considered to have a high density of internal connections, combined with a low density of external edges between different communities. However, not all naturally occurring communities in complex networks are characterized by this notion of structural equivalence, such as groups of energy states with shared quantum numbers in networks of spectral line transitions. In this paper, we focus on this inverse task of detecting anti-communities that are characterized by an exceptionally low density of internal connections and a high density of external connections. While anti-communities have been discussed in the literature for anecdotal applications or as a modification of traditional community detection, no rigorous investigation of algorithms for the problem has been presented. To this end, we introduce and discuss a broad range of possible approaches and evaluate them with regard to efficiency and effectiveness on a range of real-world and synthetic networks. Furthermore, we show that the presence of a community and anti-community structure are not mutually exclusive, and that even networks with a strong traditional community structure may also contain anti-communities.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130614548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Czajkowski, C. Kesselman, R. Schuler, H. Tangmunarunkit
{"title":"ERMrest: a web service for collaborative data management","authors":"K. Czajkowski, C. Kesselman, R. Schuler, H. Tangmunarunkit","doi":"10.1145/3221269.3222333","DOIUrl":"https://doi.org/10.1145/3221269.3222333","url":null,"abstract":"The foundation of data oriented scientific collaboration is the ability for participants to find, access and reuse data created during the course of an investigation, what has been referred to as the FAIR principles. In this paper, we describe ERMrest, a collaborative data management service that promotes data oriented collaboration by enabling FAIR data management throughout the data life cycle. ERMrest is a RESTful web service that promotes discovery and reuse by organizing diverse data assets into a dynamic entity relationship model. We present details on the design and implementation of ERMrest, data on its performance and its use by a range of collaborations to accelerate and enhance their scientific output.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130851217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","authors":"","doi":"10.1145/3221269","DOIUrl":"https://doi.org/10.1145/3221269","url":null,"abstract":"","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125935858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}