2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献_第6页

Input selection for fast feature engineering 快速特征工程的输入选择

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498272

Michael R. Anderson, Michael J. Cafarella

{"title":"Input selection for fast feature engineering","authors":"Michael R. Anderson, Michael J. Cafarella","doi":"10.1109/ICDE.2016.7498272","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498272","url":null,"abstract":"The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"16 1","pages":"577-588"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87794216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Authentication of function queries 函数查询的身份验证

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498252

Guolei Yang, Ying Cai, Zhenbi Hu

引用次数: 11

Recommendations meet web browsing: enhancing collaborative filtering using internet browsing logs 建议满足网页浏览:增强协同过滤使用互联网浏览日志

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498327

Royi Ronen, E. Yom-Tov, G. Lavee

引用次数: 16

Moolle: Fan-out control for scalable distributed data stores Moolle:可扩展分布式数据存储的扇出控制

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498325

Sun-Yeong Cho, A. Carter, J. Ehrlich, J. A. Jan

{"title":"Moolle: Fan-out control for scalable distributed data stores","authors":"Sun-Yeong Cho, A. Carter, J. Ehrlich, J. A. Jan","doi":"10.1109/ICDE.2016.7498325","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498325","url":null,"abstract":"Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"146 1","pages":"1206-1217"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90150338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TRANSFORMERS: Robust spatial joins on non-uniform data distributions 变形金刚:非均匀数据分布上的鲁棒空间连接

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498280

Mirjana Pavlovic, T. Heinis, F. Tauheed, Panagiotis Karras, A. Ailamaki

{"title":"TRANSFORMERS: Robust spatial joins on non-uniform data distributions","authors":"Mirjana Pavlovic, T. Heinis, F. Tauheed, Panagiotis Karras, A. Ailamaki","doi":"10.1109/ICDE.2016.7498280","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498280","url":null,"abstract":"Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions. In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORMERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"70 1","pages":"673-684"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90251979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

TOPIC: Toward perfect Influence Graph Summarization 主题:走向完美的影响图总结

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498314

Lei Shi, Sibai Sun, Yuan Xuan, Yue Su, Hanghang Tong, Shuai Ma, Yang Chen

{"title":"TOPIC: Toward perfect Influence Graph Summarization","authors":"Lei Shi, Sibai Sun, Yuan Xuan, Yue Su, Hanghang Tong, Shuai Ma, Yang Chen","doi":"10.1109/ICDE.2016.7498314","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498314","url":null,"abstract":"Summarizing large influence graphs is crucial for many graph visualization and mining tasks. Classical graph clustering and compression algorithms focus on summarizing the nodes by their structural-level or attribute-level similarities, but usually are not designed to characterize the flow-level pattern which is the centerpiece of influence graphs. On the other hand, the social influence analysis has been intensively studied, but little is done on the summarization problem without an explicit focus on social networks. Building on the recent study of the Influence Graph Summarization (IGS), this paper presents a new perspective of the underlying flow-based heuristic. It establishes a direct linkage between the optimal summarization and the classic eigenvector centrality of the graph nodes. Such a theoretic linkage has important implications on numerous aspects in the pursuit of a perfect influence graph summarization. In particular, it enables us to develop a suite of algorithms that can: 1) achieve a near-optimal IGS objective, 2) support dynamic summarizations balancing the IGS objective and the stability of transition in navigating the summarization, and 3) scale to million-node graphs with a near-linear computational complexity. Both quantitative experiments on real-world citation networks and the user studies on the task analysis experience demonstrate the effectiveness of the proposed summarization algorithms.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"35 8 1","pages":"1074-1085"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82802011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Self-Adaptive Linear Hashing for solid state drives 自适应线性哈希固态驱动器

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498260

Chengcheng Yang, Peiquan Jin, Lihua Yue, Dezhi Zhang

{"title":"Self-Adaptive Linear Hashing for solid state drives","authors":"Chengcheng Yang, Peiquan Jin, Lihua Yue, Dezhi Zhang","doi":"10.1109/ICDE.2016.7498260","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498260","url":null,"abstract":"Flash memory based solid state drives (SSDs) have emerged as a new alternative to replace magnetic disks due to their high performance and low power consumption. However, random writes on SSDs are much slower than SSD reads. Therefore, traditional index structures, which are designed based on the symmetrical I/O property of magnetic disks, cannot completely exert the high performance of SSDs. In this paper, we propose an SSD-optimized linear hashing index called Self-Adaptive Linear Hashing (SAL-Hashing) to reduce small random writes to SSDs that are caused by index operations. The contributions of our work are manifold. First, we propose to organize buckets into groups and sets to facilitate coarse-grained writes and lazy-split so as to avoid intermediate writes on the hash structure. A group consists of a fixed number of buckets and a set consists of a number of groups. Second, we attach a log region to each set, and amortize the cost of reads and writes by committing updates to the log region in batch. Third, in order to reduce search cost, each log region is equipped with Bloom filters to index update logs. We devise a cost-based online algorithm to adaptively merge the log region with the corresponding set when the set becomes search-intensive. Finally, in order to exploit the internal package-level parallelisms of SSDs, we apply coarse-grained writes for merging or split operations to achieve a high bandwidth. Our experimental results suggest that our proposal is self-adaptive according to the change of access patterns, and outperforms several competitors under various workloads on two commodity SSDs.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"433-444"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78522073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Cruncher: Distributed in-memory processing for location-based services Cruncher:用于基于位置的服务的分布式内存处理

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498356

A. S. Abdelhamid, Mingjie Tang, Ahmed M. Aly, Ahmed R. Mahmood, Thamir M. Qadah, Walid G. Aref, Saleh M. Basalamah

{"title":"Cruncher: Distributed in-memory processing for location-based services","authors":"A. S. Abdelhamid, Mingjie Tang, Ahmed M. Aly, Ahmed R. Mahmood, Thamir M. Qadah, Walid G. Aref, Saleh M. Basalamah","doi":"10.1109/ICDE.2016.7498356","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498356","url":null,"abstract":"Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this demonstration, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"1406-1409"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88201328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SEED: A system for entity exploration and debugging in large-scale knowledge graphs SEED:一个用于大规模知识图谱中实体探索和调试的系统

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498342

Jun Chen, Yueguo Chen, Xiaoyong Du, Xiangling Zhang, Xuan Zhou

引用次数: 9

Mining social ties beyond homophily 挖掘超越同质性的社会关系

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498259

Hongwei Liang, Ke Wang, Feida Zhu

引用次数: 4