2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

PocketView: A Concise and Informative Data Summarizer PocketView:一个简洁和信息丰富的数据总结器

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00159

Yihai Xi, Ning Wang, Shuang Hao, Wenyang Yang, Li Li

引用次数: 3

Distributed Streaming Set Similarity Join 分布式流集相似度连接

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00055

Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin

{"title":"Distributed Streaming Set Similarity Join","authors":"Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00055","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00055","url":null,"abstract":"With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78709390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR 苏联乐观压缩哈希表和字符串的高效查询处理

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00033

Tim Gubner, Viktor Leis, P. Boncz

{"title":"Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR","authors":"Tim Gubner, Viktor Leis, P. Boncz","doi":"10.1109/ICDE48307.2020.00033","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00033","url":null,"abstract":"Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"301-312"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75101677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

DynaMast: Adaptive Dynamic Mastering for Replicated Systems DynaMast:复制系统的自适应动态控制

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00123

Michael Abebe, Brad Glasbergen, Khuzaima S. Daudjee

引用次数: 13

Kronos: Lightweight Knowledge-based Event Analysis in Cyber-Physical Data Streams 网络物理数据流中的轻量级基于知识的事件分析

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00165

M. Namaki, Xin Zhang, Sukhjinder Singh, Arman Ahmed, Armina Foroutan, Yinghui Wu, A. Srivastava, Anton Kocheturov

{"title":"Kronos: Lightweight Knowledge-based Event Analysis in Cyber-Physical Data Streams","authors":"M. Namaki, Xin Zhang, Sukhjinder Singh, Arman Ahmed, Armina Foroutan, Yinghui Wu, A. Srivastava, Anton Kocheturov","doi":"10.1109/ICDE48307.2020.00165","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00165","url":null,"abstract":"We demonstrate Kronos, a framework and system that automatically extracts highly dynamic knowledge for complex event analysis in Cyber-Physical systems. Kronos captures events with anomaly-based event model, and integrates various events by correlating with their temporal associations in realtime, from heterogeneous, continuous cyber-physical measurement data streams. It maintains a lightweight highly dynamic knowledge base, enabled by online, window-based ensemble learning and incremental association analysis for event detection and linkage, respectively. These algorithms incur time costs determined by available memory, independent of the size of streams. Exploiting the highly dynamic knowledge, Kronos supports a rich set of stream event analytical queries including event search (keywords and query-by-example), provenance queries (\"which measurements or features are responsible for detected events?\"), and root cause analysis. We demonstrate how the GUI of Kronos interacts with users to support both continuous and ad-hoc queries online and enables situational awareness in Cyber-power systems, communication, and traffic networks.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"416 1","pages":"1766-1769"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84900441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Differentially Private Online Task Assignment in Spatial Crowdsourcing: A Tree-based Approach 空间众包中的差异私有在线任务分配:基于树的方法

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00051

Qian Tao, Yongxin Tong, Zimu Zhou, Yexuan Shi, Lei Chen, Ke Xu

{"title":"Differentially Private Online Task Assignment in Spatial Crowdsourcing: A Tree-based Approach","authors":"Qian Tao, Yongxin Tong, Zimu Zhou, Yexuan Shi, Lei Chen, Ke Xu","doi":"10.1109/ICDE48307.2020.00051","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00051","url":null,"abstract":"With spatial crowdsourcing applications such as Uber and Waze deeply penetrated into everyday life, there is a growing concern to protect user privacy in spatial crowdsourcing. Particularly, locations of workers and tasks should be properly processed via certain privacy mechanism before reporting to the untrusted spatial crowdsourcing server for task assignment. Privacy mechanisms typically permute the location information, which tends to make task assignment ineffective. Prior studies only provide guarantees on privacy protection without assuring the effectiveness of task assignment. In this paper, we investigate privacy protection for online task assignment with the objective of minimizing the total distance, an important task assignment formulation in spatial crowdsourcing. We design a novel privacy mechanism based on Hierarchically Well-Separated Trees (HSTs). We prove that the mechanism is ε-Geo-Indistinguishable and show that there is a task assignment algorithm with a competitive ratio of $Oleft( {frac{1}{{{varepsilon ^4}}}log N{{log }^2}k} right)$, where is the privacy budget, N is the number of predefined points on the HST, and k is the matching size. Extensive experiments on synthetic and real datasets show that online task assignment under our privacy mechanism is notably more effective in terms of total distance than under prior differentially private mechanisms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"72 1","pages":"517-528"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85942050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Array-based Data Management for Genomics 基于阵列的基因组学数据管理

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00017

Olha Horlova, Abdulrahman Kaitoua, S. Ceri

{"title":"Array-based Data Management for Genomics","authors":"Olha Horlova, Abdulrahman Kaitoua, S. Ceri","doi":"10.1109/ICDE48307.2020.00017","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00017","url":null,"abstract":"With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"76 1","pages":"109-120"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80946992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Entity Resolution on Heterogeneous Records(Extended abstract) 异构记录的高效实体解析(扩展抽象)

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.9238348

Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao

引用次数: 1

Enabling Efficient Random Access to Hierarchically-Compressed Data 启用对分层压缩数据的高效随机访问

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00097

Feng Zhang, Jidong Zhai, Xipeng Shen, O. Mutlu, Xiaoyong Du

引用次数: 13

HomoPAI: A Secure Collaborative Machine Learning Platform based on Homomorphic Encryption HomoPAI:一个基于同态加密的安全协同机器学习平台

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00152

Qifei Li, Zhicong Huang, Wen-jie Lu, Cheng Hong, Hunter Qu, Hui He, Weizhe Zhang

引用次数: 11