Proceedings of the 29th International Conference on Scientific and Statistical Database Management最新文献

筛选
英文 中文
Bi-Level Online Aggregation on Raw Data 原始数据的双级在线聚合
Yu Cheng, Weijie Zhao, Florin Rusu
{"title":"Bi-Level Online Aggregation on Raw Data","authors":"Yu Cheng, Weijie Zhao, Florin Rusu","doi":"10.1145/3085504.3085514","DOIUrl":"https://doi.org/10.1145/3085504.3085514","url":null,"abstract":"In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. The performance of in-situ processing across a query workload is, however, limited by the speed of full scan, tokenizing, and parsing of the entire data. Online aggregation (OLA) has been introduced as an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing---the computation can be stopped as early as the estimate is accurate enough to be deemed uninteresting. However, existing OLA solutions have a high upfront cost of randomly shuffling and/or sampling the data. In this paper, we present OLA-RAW, a bi-level sampling scheme for parallel online aggregation over raw data. Sampling in OLA-RAW is query-driven and performed exclusively in-situ during the runtime query execution, without data reorganization. This is realized by a novel resource-aware bi-level sampling algorithm that processes data in random chunks concurrently and determines adaptively the number of sampled tuples inside a chunk. In order to avoid the cost of repetitive conversion from raw data, OLA-RAW builds and maintains a memory-resident bi-level sample synopsis incrementally. We implement OLA-RAW inside a modern in-situ data processing system and evaluate its performance across several real and synthetic datasets and file formats. Our results show that OLA-RAW chooses the sampling plan that minimizes the execution time and guarantees the required accuracy for each query in a given workload. The end result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116917566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
DualDB: An Efficient LSM-based Publish/Subscribe Storage System DualDB:基于lsm的高效发布/订阅存储系统
Mohiuddin Abdul Qader, Vagelis Hristidis
{"title":"DualDB: An Efficient LSM-based Publish/Subscribe Storage System","authors":"Mohiuddin Abdul Qader, Vagelis Hristidis","doi":"10.1145/3085504.3085528","DOIUrl":"https://doi.org/10.1145/3085504.3085528","url":null,"abstract":"Publish/Subscribe systems allow subscribers to monitor for events of interest generated by publishers. Current publish/subscribe query systems are efficient when the subscriptions (queries) are relatively static -- for instance, the set of followers in Twitter -- or can fit in memory. However, an increasing number of applications in this era of Big Data and Internet of Things (IoT) are based on a highly dynamic query paradigm, where continuous queries are in the millions and are created and expire in a rate comparable, or even higher, to that of the data (event) entries. For instance moving objects like airplanes, cars or sensors may continuously generate measurement data like air pressure or traffic, which are consumed by other moving objects. In this paper we propose and compare a novel publish/subscribe storage architecture, DualDB, based on the popular NoSQL Log-Structured Merge Tree (LSM) storage paradigm, to support high-throughput and dynamic publish/subscribe systems. Our method naturally supports queries on both past and future data, and generate instant notifications, which are desirable properties missing from many previous systems. We implemented and experimentally evaluated our methods on the popular LSM-based LevelDB system, using real datasets. Our results show that we can achieve significantly higher throughput compared to state-of-the-art baselines.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128178520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SHRec: Scalable Holistic Recommendation SHRec:可扩展的整体推荐
Ahmed M. Aly, M. Hammad, Amr Ahmed
{"title":"SHRec: Scalable Holistic Recommendation","authors":"Ahmed M. Aly, M. Hammad, Amr Ahmed","doi":"10.1145/3085504.3085523","DOIUrl":"https://doi.org/10.1145/3085504.3085523","url":null,"abstract":"The problem of recommending items to users is of high practical importance. For instance, many web services try to find relevant recommendations for the users, e.g., finding relevant movies, social-media friends, restaurants, shopping items, etc. The expansion of the Web and the ever-growing number of people who use web services render the problem of recommendation challenging. The Locality Sensitive Hashing (LSH, for short) is the most known scalable technique for nearest-neighbor search in high dimensional data, and hence the LSH is widely used in most industrial recommendation systems. This paper presents an implementation of the LSH using Google's MapReduce engine. We apply the LSH to a real case study at Google, where we recommend for each web-host a set of outlinks based on the outlink similarity amongst the web-hosts. We identify some performance limitations of the LSH that occur due to specific properties in the data, and that become significant when the scale of the data is large. Furthermore, we present SHRec, a novel technique for scalable recommendation that addresses these performance limitations. Based on real deployment of both SHRec and LSH on Google's infrastructure, and using real data of the crawled Web at Google, where a sample host-level graph of 1.5 Billion web-hosts is extracted, we demonstrate that SHRec is more scalable than LSH. In particular, we show that SHRec is one order of magnitude faster than LSH while achieving better recommendation quality.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131921692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Series Similarity Using Correlation-Aware Measures 使用关联感知度量的数据序列相似性
Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas
{"title":"Data Series Similarity Using Correlation-Aware Measures","authors":"Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas","doi":"10.1145/3085504.3085515","DOIUrl":"https://doi.org/10.1145/3085504.3085515","url":null,"abstract":"The increased availability of unprecedented amounts of sequential data (generated by Internet-of-Things, as well as scientific applications) has led in the past few years to a renewed interest and attention to the field of data series processing and analysis. Data series collections are processed and analyzed using a large variety of techniques, most of which are based on the computation of some distance function. In this study, we revisit this basic operation of data series distance calculation. We observe that the popular distance measures are oblivious to the correlations inherent in neighboring values in a data series. Therefore, we evaluate the plausibility and benefit of incorporating into the distance function measures of correlation, which enable us to capture the associations among neighboring values in the sequence. We propose four such measures, inspired by statistical and probabilistic approaches, which can effectively model these correlations. We analytically and experimentally demonstrate the benefits of the new measures using the 1NN classification task, and discuss the lessons learned. Finally, we propose future research directions for enabling the proposed measures to be used in practice.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Discovering Partial Periodic Itemsets in Temporal Databases 在时态数据库中发现部分周期项集
R. U. Kiran, Haichuan Shang, Masashi Toyoda, M. Kitsuregawa
{"title":"Discovering Partial Periodic Itemsets in Temporal Databases","authors":"R. U. Kiran, Haichuan Shang, Masashi Toyoda, M. Kitsuregawa","doi":"10.1145/3085504.3085535","DOIUrl":"https://doi.org/10.1145/3085504.3085535","url":null,"abstract":"A temporal database is a collection of transactions, ordered by their timestamps. Discovering partial periodic itemsets in temporal databases has numerous applications. However, to the best of our knowledge, no work has considered finding these itemsets in temporal databases, despite that this type of data is very common in real-life. Discovering partial periodic itemsets in temporal databases is challenging. It requires defining (i) an appropriate measure to assess the periodic interestingness of itemsets, and (ii) an algorithm to efficiently find all partial periodic itemsets. While a pattern-growth algorithm can be employed for the second sub-task, the first sub-task has not been addressed. Moreover, how these two tasks are combined has significant implications. In this paper, we address this challenge. We introduce a model to find partial periodic itemsets in temporal databases. A new measure, called periodic-frequency, has been proposed to determine the periodic interestingness of itemsets by taking into account their number of cyclic repetitions in the entire data. Moreover, the paper introduces a pattern-growth algorithm to discover all partial periodic itemsets. Experimental results demonstrate that our model is efficient.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126075460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
How the Passengers Flow in Complex Metro Networks? 复杂地铁网络中乘客如何流动?
Guandong Sun, Yun Xiong, Yangyong Zhu
{"title":"How the Passengers Flow in Complex Metro Networks?","authors":"Guandong Sun, Yun Xiong, Yangyong Zhu","doi":"10.1145/3085504.3085527","DOIUrl":"https://doi.org/10.1145/3085504.3085527","url":null,"abstract":"The understanding of passenger flow assignment in metro network is critical for public transit management. However, the route chosen by one passenger is difficult to be directly obtained according to the transaction records only including each trip's tap-in and tap-out time stamp and stations. In this paper, a two-stage framework for calculating passenger flow assignment in complex metro networks is proposed, named PaFA (Passenger Flow Assignment), by using smart card data. First, we design an acceleration search process to obtain all routes for each O-D pair and select the candidate routes under rules. Then, inspired by topic model, we realize similar latent relationships also can be found among O-D pair, candidate routes and passenger's travel time. Along this line, we obtain the distribution of passenger flow in different candidate routes. Finally, a comprehensive evaluation with real-world data is conducted. The results demonstrate the enhanced performance of the proposed method.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126832280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Active Learning with Density-Initialized Decision Tree for Record Matching 基于密度初始化决策树的主动学习记录匹配
Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong
{"title":"Active Learning with Density-Initialized Decision Tree for Record Matching","authors":"Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong","doi":"10.1145/3085504.3085518","DOIUrl":"https://doi.org/10.1145/3085504.3085518","url":null,"abstract":"One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125697547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Tiling Strategies for Distributed Point Cloud Databases 分布式点云数据库的平铺策略
J. Szalai-Gindl, L. Dobos, I. Csabai
{"title":"Tiling Strategies for Distributed Point Cloud Databases","authors":"J. Szalai-Gindl, L. Dobos, I. Csabai","doi":"10.1145/3085504.3085537","DOIUrl":"https://doi.org/10.1145/3085504.3085537","url":null,"abstract":"Many large point clouds -- such as cosmological N-body simulations, intersections of road networks etc. -- are strongly clustered on a hierarchy of scales. In shared nothing distributed environments, optimized tiling of data is crucial to minimize cross-server communication and balance IO and processing load. We propose histogram-based tiling algorithms, a hierarchical tiling and a spectral clustering algorithm, that can be incorporated into the data extraction or transformation phase of a typical Extraction--Transformation--Loading (ETL) procedure. We define measures to characterize the performance of these tiling techniques with respect to typical spatial search operations, and evaluate the algorithms based on these measures using hierarchically clustered data sets.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128258336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
PLI: Augmenting Live Databases with Custom Clustered Indexes PLI:使用自定义聚类索引增强实时数据库
J. Wagner, A. Rasin, Dai Hai Ton That, T. Malik
{"title":"PLI: Augmenting Live Databases with Custom Clustered Indexes","authors":"J. Wagner, A. Rasin, Dai Hai Ton That, T. Malik","doi":"10.1145/3085504.3085582","DOIUrl":"https://doi.org/10.1145/3085504.3085582","url":null,"abstract":"RDBMSes only support one clustered index per database table that can speed up query processing. Database applications, that continually ingest large amounts of data, perceive slow query response times to long downtimes, as the clustered index ordering must be strictly maintained. In this paper, we show that application slowdown or downtime, however, can often be avoided if database systems expose the physical location of attributes that are completely or approximately clustered. Towards this, we propose PLI, a physical location index, constructed by determining the physical ordering of an attribute and creating approximately sorted buckets that map physical ordering with attribute values in a live database. To use a PLI incoming SQL queries are simply rewritten with physical ordering information for that particular database. Experiments show queries with the PLI index significantly outperform queries using native unclustered (secondary) indexes, while the index itself requires a much lower maintenance overheads when compared to native clustered indexes.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132800101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Detecting Global Hyperparaboloid Correlated Clusters Based on Hough Transform 基于Hough变换的全局超抛物面相关聚类检测
Daniyal Kazempour, Markus Mauder, Peer Kröger, T. Seidl
{"title":"Detecting Global Hyperparaboloid Correlated Clusters Based on Hough Transform","authors":"Daniyal Kazempour, Markus Mauder, Peer Kröger, T. Seidl","doi":"10.1145/3085504.3085536","DOIUrl":"https://doi.org/10.1145/3085504.3085536","url":null,"abstract":"Correlation clustering detects complex and intricate relationships in high-dimensional data by identifying groups of data points, each characterized by differents correlation among a (sub)set of features. Current correlation clustering methods generally limit themselves to linear correlations only. In this paper, we introduce a method for detecting global non-linear correlated clusters focusing on quadratic relations. We introduce a novel Hough transform for the detection of hyperparaboloids and apply it to the detection of hyperparaboloid correlated clusters in arbitrary high-dimensional data spaces. Non-linear correlation clustering like our method can reveal valuable insights which are not covered by current linear versions. Our empirical results on synthetic and real world data reveal that the proposed method is robust against noise, jitter and irregular densities.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131012423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信