2016 45th International Conference on Parallel Processing (ICPP)最新文献_第3页

Parallel Tree Traversal for Nearest Neighbor Query on the GPU GPU上最近邻查询的并行树遍历

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.20

Moohyeon Nam, Jinwoong Kim, Beomseok Nam

{"title":"Parallel Tree Traversal for Nearest Neighbor Query on the GPU","authors":"Moohyeon Nam, Jinwoong Kim, Beomseok Nam","doi":"10.1109/ICPP.2016.20","DOIUrl":"https://doi.org/10.1109/ICPP.2016.20","url":null,"abstract":"The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133760671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Tetris Write: Exploring More Write Parallelism Considering PCM Asymmetries 俄罗斯方块写入:考虑到PCM不对称探索更多写入并行性

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.25

Zheng Li, F. Wang, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Xiang Liu

{"title":"Tetris Write: Exploring More Write Parallelism Considering PCM Asymmetries","authors":"Zheng Li, F. Wang, D. Feng, Yu Hua, Wei Tong, Jingning Liu, Xiang Liu","doi":"10.1109/ICPP.2016.25","DOIUrl":"https://doi.org/10.1109/ICPP.2016.25","url":null,"abstract":"The noises at the power lines limit the charge pump to provide large instantaneous current to PCM cells, which results in the number of bits can be written concurrently, i.e. the size of write unit, is restricted in PCM. When implementing PCM as the main memory, the inequality of cache line's size and write unit's size may result in many consecutive executed write units, which greatly decreases the system performance. Existing PCM write schemes, however, consider the worst power and time cases of written data, and ignore the actual current consumption. It is assumed that all data bits are changed and the electric current of each data unit is under fully utilized. The write performance is blocked due to pessimistic estimates, i.e. the current is often excessively supplied but is not used effectively, which leads to huge energy consumption. As a result, the write parallelism is limited and therefore restricts the overall system performance. To address this problem, this paper proposes a novel PCM write scheme named Tetris Write to explore more write parallelism and reduce the critical number of write units in PCM chip. The key idea behind Tetris Write is to monitor the number of '1' and '0' changed in each data unit, and schedule the order of data units' write-1 and write-0 execution considering not only the time and power asymmetries, but also the number asymmetry between RET and SET operations, to allow a larger number of concurrent bit-writes and make the best use of power supply. Tetris Write tries to schedule the dominating long term write-1s first and attempts to steal interspaces remained by write-1s to put the extraessential short write-0s. 4-core PARSEC benchmarks' results show that Tetris Write can get 65% read latency reduction, 40% write latency reduction, 46% running time reduction and 2X IPC improvement compared with the baseline on average. In addition, Tetris Write earns 26%, 15% and 10% more read latency reduction, 15%, 7% and 5% more write latency reduction, and outperforms 22%, 12% and 7% more running time reduction, compared with the state-of-the-art Flip-N-Write, 2-Stage-Write and Three-Stage-Write schemes, whose IPC improvements are 1.4X, 1.6X and 1.8X, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116868130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Improving RAID Performance Using an Endurable SSD Cache 使用耐用SSD Cache提升RAID性能

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.52

Chu Li, D. Feng, Yu Hua, F. Wang

{"title":"Improving RAID Performance Using an Endurable SSD Cache","authors":"Chu Li, D. Feng, Yu Hua, F. Wang","doi":"10.1109/ICPP.2016.52","DOIUrl":"https://doi.org/10.1109/ICPP.2016.52","url":null,"abstract":"Parity-based RAID storage systems have been widely deployed in production environments. However, they suffer from poor random write performance due to the parity update overhead, i.e., small write problem. With the increasing density and decreasing price, SSD-based caching offers promising opportunities for improving RAID storage I/O performance. However, as a cache device, frequent writes to SSD leads to being quickly worn out, which causes high costs and reliability problems. In this paper, we propose an efficient cache management scheme by Keeping Data and Deltas (KDD) in SSD. KDD dynamically partitions the cache space into Data Zone (DAZ) and Delta Zone (DEZ). DAZ stores data that are first admitted into SSD. On write hits, KDD writes the data to RAID storage without updating the parity blocks. Meanwhile, the deltas between old version of data and the currently accessed data are compactly stored in DEZ. In addition, KDD organizes the metadata partition on SSD as a circular log to make the cache persistent with low overhead. We evaluate the performance of KDD via both simulations and prototype implementations. Results show that KDD effectively reduces the small write penalty while significantly improving the lifetime of the SSD-based cache.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115599678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Efficient 2-Body Statistics Computation on GPUs: Parallelization & Beyond gpu上高效的2体统计计算:并行化除了

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.50

Napath Pitaksirianan, Zhila Nouri, Yi-Cheng Tu

{"title":"Efficient 2-Body Statistics Computation on GPUs: Parallelization &amp; Beyond","authors":"Napath Pitaksirianan, Zhila Nouri, Yi-Cheng Tu","doi":"10.1109/ICPP.2016.50","DOIUrl":"https://doi.org/10.1109/ICPP.2016.50","url":null,"abstract":"Various types of two-body statistics (2-BS) are regarded as essential components of data analysis in many scientific and computing domains. Due to the quadratic time complexity, use of modern parallel hardware has become an obvious direction for research and practice in 2-BS computation. This paper presents our recent work in designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). First, we classify 2-body applications into three groups based on their data output pattern. Then, we introduce a straightforward parallel algorithm under the CUDA framework. To that end, we split the algorithm into two stages: pairwise distance function computation and writing output. Then, we present modifications to the basic algorithm by integrating various techniques at each stage. Our algorithms design focuses on effective use of hardware/software features that are unique in GPU platforms. Experiments run on modern GPU hardware show that our GPU algorithms outperform the best known CPU program by at least an order of magnitude in various applications. Furthermore, our implementation achieves very high level of GPU resource utilization, indicating near-optimal performance. This work builds a solid foundation towards realizing our vision of a framework that can automatically generate optimized code for any new 2-BS problems.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121080765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Real-Time Traffic Light Scheduling with Taxi Traces 基于出租车轨迹的实时交通灯调度研究

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.43

Zongjian He, Daqiang Zhang, Jiannong Cao, Xuefeng Liu, Xiaopeng Fan, Chengzhong Xu

引用次数: 3

Fault Tolerant Support Vector Machines 容错支持向量机

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.75

Sameh M. Shohdy, Abhinav Vishnu, G. Agrawal

{"title":"Fault Tolerant Support Vector Machines","authors":"Sameh M. Shohdy, Abhinav Vishnu, G. Agrawal","doi":"10.1109/ICPP.2016.75","DOIUrl":"https://doi.org/10.1109/ICPP.2016.75","url":null,"abstract":"Support Vector Machines (SVM) is a popular Machine Learning algorithm, which is used for building classifiers and models. Parallel implementations of SVM, which can run on large scale supercomputers, are becoming commonplace. However, these supercomputers -- designed under constraints of data movement -- frequently observe faults in compute devices. Many device faults manifest as permanent process/node failures. In this paper, we present several approaches for designing fault tolerant SVM algorithms. First, we present an in-depth analysis to identify the critical data structures, and build baseline algorithms that simply periodically checkpoint these data structures. Next, we propose a novel algorithm, which requires no inter-node data movement for checkpointing, and only O(n2/p2) recovery time -- a small fraction of the expected O(n3/p) time-complexity of SVM. We implement these algorithms and evaluate them on a large scale cluster. Our evaluation indicates that the overall data movement for checkpointing in the baseline algorithm can be up to 100x the dataset size!, while the proposed novel algorithm is completely communication-free of checkpointing. In addition, it saves up to 20x space, while providing better (by an average of 5.5x speedup on 256 cores) recovery time than the baseline algorithm with different number of checkpoints. The experiments also show that our communication avoiding algorithm outperforms Spark MLLib SVM implementation by an average of 6.4x with 256 cores in the case of failure.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131857904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TECH: A Thermal-Aware and Cost Efficient Mechanism for Colocation Demand Response 技术:托管需求响应的热感知和成本效益机制

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.60

Ziqi Zhao, Fan Wu, Shaolei Ren, Xiaofeng Gao, Guihai Chen, Yong Cui

{"title":"TECH: A Thermal-Aware and Cost Efficient Mechanism for Colocation Demand Response","authors":"Ziqi Zhao, Fan Wu, Shaolei Ren, Xiaofeng Gao, Guihai Chen, Yong Cui","doi":"10.1109/ICPP.2016.60","DOIUrl":"https://doi.org/10.1109/ICPP.2016.60","url":null,"abstract":"Data centers are promising participants in emergency demand response (EDR) programs, in which the power grids incentivize large energy consumers to reduce energy consumption in emergency to avoid potential huge financial losses. However, in multi-tenant colocation data centers, tenants manage their own servers and often sign fixed energy contracts with data center operators, thus having no incentives to contribute to EDR. To solve this problem, several studies have investigated various market-based mechanisms to incentivize tenants to reduce their server energy consumption for EDR. Nonetheless, these purely market-based studies are severely limited in one or both of the following key aspects. (1) Lack of coordination of cooling system: Due to thermal unawareness, the existing mechanisms leave the supplied cooling air temperature at an unnecessarily low level to avoid server overheating, resulting in cooling energy inefficiency (2) Violation of cost efficiency: The mechanism must be implemented in a cost efficient way such that operators do not lose financial interest, which, however, is violated by many of the existing mechanisms. This work proposes a novel thermal-aware and cost efficient mechanism, called TECH, which coordinate tenants' energy reduction in concert with the cooling system control to enable colocation EDR in a cost efficient way.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132616472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale SWAP-Assembler 2: De Novo基因组组装器在极端规模下的优化

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.29

Jintao Meng, Sangmin Seo, P. Balaji, Yanjie Wei, Bingqiang Wang, Shengzhong Feng

{"title":"SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale","authors":"Jintao Meng, Sangmin Seo, P. Balaji, Yanjie Wei, Bingqiang Wang, Shengzhong Feng","doi":"10.1109/ICPP.2016.29","DOIUrl":"https://doi.org/10.1109/ICPP.2016.29","url":null,"abstract":"In this paper, we analyze and optimize the most time-consuming steps of the SWAP-Assembler, a parallel genome assembler, so that it can scale to a large number of cores for huge genomes with sequencing data ranging from terabyes to petabytes. Performance analysis results show that the most time-consuming steps are input parallelization, k-mer graph construction, and graph simplification (edge merging). For the input parallelization, the input data is divided into virtual fragments with nearly equal size, and the start position and end position of each fragment are automatically separated at the beginning of the reads. In k-mer graph construction, in order to improve the communication efficiency, the message size is kept constant between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step for each round. The memory usage is also decreased because only a small part of the input data is processed in each round. With graph simplification, the communication protocol reduces the number of communication loops from four to two loops and decreases the idle communication time. The optimized assembler is denoted SWAP-Assembler 2 (SWAP2). In our experiments using a 1000 Genomes project dataset of 4 terabytes (the largest dataset ever used for assembling) on the supercomputer Mira, the results show that SWAP2 scales to 131,072 cores with an efficiency of 40%. We also compared our work with both the HipMer assembler and the SWAP-Assembler. On the Yanhuang dataset of 300 gigabytes, SWAP2 shows a 3X speedup and 4X better scalability compared with the HipMer assembler and is 45 times faster than the SWAP-Assembler. The SWAP2 software is available at https://sourceforge.net/projects/swapassembler.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123913583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Exploring Variation-Aware Fault-Tolerant Cache under Near-Threshold Computing 近阈值计算下的变化感知容错缓存研究

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.24

Jing Wang, Yanjun Liu, Wei-gong Zhang, Kezhong Lu, Keni Qiu, Xin Fu, Tao Li

{"title":"Exploring Variation-Aware Fault-Tolerant Cache under Near-Threshold Computing","authors":"Jing Wang, Yanjun Liu, Wei-gong Zhang, Kezhong Lu, Keni Qiu, Xin Fu, Tao Li","doi":"10.1109/ICPP.2016.24","DOIUrl":"https://doi.org/10.1109/ICPP.2016.24","url":null,"abstract":"Near threshold voltage computing enables transistor voltage scaling to continue with Moore's Law projection and dramatically improves power and energy efficiency. However, reducing the supply voltage to near-threshold level significantly increases the susceptibility of on-chip caches to process variations, leading to the high error rate. Most existing fault-tolerant schemes significantly sacrifice cache capacity and performance. In this paper, we propose a novel fault-tolerant cache architecture at near-threshold computing, which is suitable for high error rate memories. We first propose a variation-aware skewed-associative cache, and then redirect the faulty blocks to the error-free blocks based on it to explore the fault-tolerance cache design. Unlike previous cache reconfiguration schemes for the fault tolerance, our cache design does not need to sacrifice or disable any fault-free blocks to form a completely functional set. We use all error-free blocks and have the least cache capacity waste. More importantly, since the aging impact could also cause cell failures, our skewed cache takes the aggregated process variation and aging impact into the consideration. Last but not least, our skewed cache design avoids the complex remapping from faulty blocks to the error-free blocks and minimizes the hardware overheads. Our evaluation results show that our variation-aware fault-tolerant cache design exhibits strong capability to tolerate the high error rate, and more excitingly, its effectiveness on reducing the cache miss rate and improving the performance is even more obvious as the supply voltage scales down to the near-threshold region.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122561248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CoARC: Co-operative, Aggressive Recovery and Caching for Failures in Erasure Coded Hadoop CoARC: Erasure编码Hadoop中的协作、主动恢复和故障缓存

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.40

P. Subedi, Ping Huang, Tong Liu, Joseph Moore, S. Skelton, Xubin He

{"title":"CoARC: Co-operative, Aggressive Recovery and Caching for Failures in Erasure Coded Hadoop","authors":"P. Subedi, Ping Huang, Tong Liu, Joseph Moore, S. Skelton, Xubin He","doi":"10.1109/ICPP.2016.40","DOIUrl":"https://doi.org/10.1109/ICPP.2016.40","url":null,"abstract":"Cloud file systems like Hadoop have become a norm for handling big data because of the easy scaling and distributed storage layout. However, these systems are susceptible to failures and data needs to be recovered when a failure is detected. During temporary failures, MapReduce jobs or file system clients perform degraded reads and satisfy the read request. We argue that lack of sharing of the recovered data during degraded reads and recovery of only the requested data block places a heavy strain on the system's network resources and increases the job execution time. To this end, we propose CoARC (Co-operative, Aggressive Recovery and Caching), which is a new data-recovery mechanism for unavailable data during degraded reads in distributed file systems. The main idea is to recover not only the data block that was requested but also other temporarily unavailable blocks in the same strip and cache them in a separate data node. We also propose an LRF (Least Recently Failed) cache replacement algorithm for such a kind of recovery caches. We also show that CoARC significantly reduces the network usage and job runtime in erasure coded Hadoop.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124810777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5