2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献_第2页

Optimizing secure classification performance with privacy-aware feature selection 通过隐私感知特征选择优化安全分类性能

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498242

Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, B. Malin

{"title":"Optimizing secure classification performance with privacy-aware feature selection","authors":"Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, B. Malin","doi":"10.1109/ICDE.2016.7498242","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498242","url":null,"abstract":"Recent advances in personalized medicine point towards a future where clinical decision making will be dependent upon the individual characteristics of the patient, e.g., their age, race, genomic variation, and lifestyle. Already, there are numerous commercial entities working towards the provision of software to support such decisions as cloud-based services. However, deployment of such services in such settings raises important challenges for privacy. A recent attack shows that disclosing personalized drug dosage recommendations, combined with several pieces of demographic knowledge, can be leveraged to infer single nucleotide polymorphism variants of a patient. One manner to prevent such inference is to apply secure multi-party computation (SMC) techniques that hide all patient data, so that no information, including the clinical recommendation, is disclosed during the decision making process. Yet, SMC is a computationally cumbersome process and disclosing some information may be necessary for various compliance purposes. Additionally, certain information (e.g., demographic information) may already be publicly available. In this work, we provide a novel approach to selectively disclose certain information before the SMC process to significantly improve personalized decision making performance while preserving desired levels of privacy. To achieve this goal, we introduce mechanisms to quickly compute the loss in privacy due to information disclosure while considering its performance impact on SMC execution phase. Our empirical analysis show that we can achieve up to three orders of magnitude improvement compared to pure SMC solutions with only a slight increase in privacy risks.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"217-228"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76763927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient handling of concept drift and concept evolution over Stream Data 有效地处理流数据上的概念漂移和概念演变

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498264

Ahsanul Haque, L. Khan, M. Baron, B. Thuraisingham, C. Aggarwal

{"title":"Efficient handling of concept drift and concept evolution over Stream Data","authors":"Ahsanul Haque, L. Khan, M. Baron, B. Thuraisingham, C. Aggarwal","doi":"10.1109/ICDE.2016.7498264","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498264","url":null,"abstract":"To decide if an update to a data stream classifier is necessary, existing sliding window based techniques monitor classifier performance on recent instances. If there is a significant change in classifier performance, these approaches determine a chunk boundary, and update the classifier. However, monitoring classifier performance is costly due to scarcity of labeled data. In our previous work, we presented a semi-supervised framework SAND, which uses change detection on classifier confidence to detect a concept drift. Unlike most approaches, it requires only a limited amount of labeled data to detect chunk boundaries and to update the classifier. However, SAND is expensive in terms of execution time due to exhaustive invocation of the change detection module. In this paper, we present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, we provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. Experiment results show efficiency of the proposed framework in terms of both accuracy and execution time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"38 1","pages":"481-492"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78117274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

Collaborative analytics for data silos 数据孤岛的协作分析

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498286

Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha

{"title":"Collaborative analytics for data silos","authors":"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha","doi":"10.1109/ICDE.2016.7498286","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498286","url":null,"abstract":"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"152 1","pages":"743-754"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74904219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Anonymizing collections of tree-structured data 对树状结构数据集合进行匿名化

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498404

Olga Gkountouna, Manolis Terrovitis

引用次数: 17

DataXFormer: A robust transformation discovery system DataXFormer:一个健壮的转换发现系统

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498319

Ziawasch Abedjan, J. Morcos, I. Ilyas, M. Ouzzani, Paolo Papotti, M. Stonebraker

引用次数: 59

Load balancing and skew resilience for parallel joins 并行连接的负载平衡和倾斜弹性

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498250

Aleksandar Vitorovic, Mohammed Elseidy, Christoph E. Koch

引用次数: 25

Flow-Join: Adaptive skew handling for distributed joins over high-speed networks Flow-Join:高速网络上分布式连接的自适应倾斜处理

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498324

Wolf Rödiger, S. Idicula, A. Kemper, Thomas Neumann

{"title":"Flow-Join: Adaptive skew handling for distributed joins over high-speed networks","authors":"Wolf Rödiger, S. Idicula, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2016.7498324","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498324","url":null,"abstract":"Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4×FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8× faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"1194-1205"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89689275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Discovering interpretable geo-social communities for user behavior prediction 为用户行为预测发现可解释的地理社会社区

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498303

Hongzhi Yin, Zhiting Hu, Xiaofang Zhou, Hao Wang, Kai Zheng, Nguyen Quoc Viet Hung, S. Sadiq

{"title":"Discovering interpretable geo-social communities for user behavior prediction","authors":"Hongzhi Yin, Zhiting Hu, Xiaofang Zhou, Hao Wang, Kai Zheng, Nguyen Quoc Viet Hung, S. Sadiq","doi":"10.1109/ICDE.2016.7498303","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498303","url":null,"abstract":"Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"942-953"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91534110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

A novel, low-latency algorithm for multiple Group-By query optimization 一种新颖的、低延迟的多Group-By查询优化算法

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498249

Duy-Hung Phan, P. Michiardi

引用次数: 5

ORLF: A flexible framework for online record linkage and fusion ORLF:一个灵活的在线记录链接和融合框架

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498349

E. Rezig, Eduard Constantin Dragut, M. Ouzzani, A. Elmagarmid, Walid G. Aref

引用次数: 6