Shuibing He, Xian-He Sun, Yang Wang, Antonios Kougkas, Adnan Haider
{"title":"A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems","authors":"Shuibing He, Xian-He Sun, Yang Wang, Antonios Kougkas, Adnan Haider","doi":"10.1109/ICPP.2015.43","DOIUrl":"https://doi.org/10.1109/ICPP.2015.43","url":null,"abstract":"Parallel file systems (PFS) are commonly used in high-end computing systems. With the emergence of solid state drives (SSD), hybrid PFSs, which consist of both HDD and SSD servers, provide a practical I/O system solution for data-intensive applications. However, most existing PFS layout schemes are inefficient for hybrid PFSs due to their lack of awareness of the performance differences between heterogeneous servers and the workload changes between different parts of a file. This lack of recognition can result in severe I/O performance degradation. In this study, we propose a heterogeneity-aware region-level (HARL) data layout scheme to improve the data distribution of a hybrid PFS. HARL first divides a file into fine-grained, varying sized regions according to the changes of an application's I/O workload, then chooses appropriate file stripe sizes on heterogeneous servers based on the server performance for each file region. Experimental results of representative benchmarks show that HARL can greatly improve the I/O system performance.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128355501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqiang Ou, Marc Patton, M. D. Moore, Yuehai Xu, Song Jiang
{"title":"A Penalty Aware Memory Allocation Scheme for Key-Value Cache","authors":"Jianqiang Ou, Marc Patton, M. D. Moore, Yuehai Xu, Song Jiang","doi":"10.1109/ICPP.2015.62","DOIUrl":"https://doi.org/10.1109/ICPP.2015.62","url":null,"abstract":"Key-value caches, represented by Mem cached, play a critical role in data centers. Its efficacy can significantly impact users' perceived service time and back-end systems' workloads. A central issue in the in-memory cache's management is memory allocation, or how the limited space is distributed for storing key-value items of various sizes. When a cache is full, the allocation issue is how to conduct replacement operations on items of different sizes. To effectively address the issue, a practitioner must simultaneously consider three factors, which are access locality, item size, and miss penalty. Existing designs consider only one or two of the first two factors, and pay little attention on miss penalty. This inadequacy can substantially compromise utilization of cache space and request service time. In this paper we propose a Penalty Aware Memory Allocation scheme (PAMA) that takes all three factors into account. While the three different factors cannot be directly compared to each other in a quantitative manner, PAMA uses their impacts on service time to determine where a unit of memory space should be (de)allocated. The impacts are quantified as the decrease (or increase) of service time if a unit of space is allocated (or deal located). PAMA efficiently tracks access pattern and use of memory, and speculatively evaluates the impacts to enable penalty-aware memory allocation for KV caches. Our evaluation with real-world Mem cached workload traces demonstrates that PAMA can significantly reduce request service time compared to other representative KV cache management schemes.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128429219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"REED: A Reliable Energy-Efficient RAID","authors":"Shu Yin, Xuewu Li, Kenli Li, Jianzhong Huang, X. Ruan, Xiaomin Zhu, Wei Cao, X. Qin","doi":"10.1109/ICPP.2015.74","DOIUrl":"https://doi.org/10.1109/ICPP.2015.74","url":null,"abstract":"Recent studies indicate that the energy cost and carbon footprint of data centers have become exorbitant. It is a demanding and challenging task to reduce energy consumption in large-scale storage systems in modern data centers. Most energy conservation techniques inevitably have adverse impacts on parallel disk systems. To address the reliability issues of energy-efficient parallel disks, we propose a reliable energy-efficient RAID system called REED, which aims at improving both energy efficiency and reliability of RAID systems by seamlessly integrating HDDs and SSDs. At the heart of REED is a high-performance cache mechanism powered by SSDs, which are serving popular data. Under light workload conditions, REED spins down HDDs into the low-power mode, thereby offering energy conservation. Importantly, during an I/O access turbulence (i.e., I/O load is dynamically and frequently changing), REED is conducive to reducing the number of disk power-state transitions by keeping HDDs in the low-power mode while serving requests with SSDs. We build a model to quantitatively show that REED is capable of improving the reliability of energy-efficient RAIDs. We implement the REED prototype in a real-world RAID-0 system. Our experimental results demonstrate that REED improves the energy-efficiency of conventional RAID-0 by up to 73% while maintaining good reliability.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130678704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Responsive Knapsack-Based Algorithm for Resource Provisioning and Scheduling of Scientific Workflows in Clouds","authors":"M. A. Rodriguez, R. Buyya","doi":"10.1109/ICPP.2015.93","DOIUrl":"https://doi.org/10.1109/ICPP.2015.93","url":null,"abstract":"Scientific workflows are used to process vast amounts of data and to conduct large-scale experiments and simulations. They are time consuming and resource intensive applications that benefit from running in distributed platforms. In particular, scientific workflows can greatly leverage the ease-of-access, affordability, and scalability offered by cloud computing. To achieve this, innovative and efficient ways of orchestrating the workflow tasks and managing the compute resources in a cost-conscious manner need to be developed. We propose an adaptive, resource provisioning and scheduling algorithm for scientific workflows deployed in Infrastructure as a Service clouds. Our algorithm was designed to address challenges specific to clouds such as the pay-as-you-go model, the performance variation of resources and the on-demand access to unlimited, heterogeneous virtual machines. It is capable of responding to the dynamics of the cloud infrastructure and is successful in generating efficient solutions that meet a user-defined deadline and minimise the overall cost of the used infrastructure. Our simulation experiments demonstrate that it performs better than other state-of-the-art algorithms.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131925582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bit Flipping Errors in High Performance Linpack at Exascale and Beyond","authors":"Erlin Yao, Guangming Tan","doi":"10.1109/ICPP.2015.51","DOIUrl":"https://doi.org/10.1109/ICPP.2015.51","url":null,"abstract":"For the High Performance Linpack (HPL) benchmark at the coming Exascale and beyond, silent errors like bit flipping in memory are expected to become inevitable. However, since bit flipping errors are difficult to be detected and located, their impact to the numerical correctness of HPL has not been evaluated thoroughly and quantitatively, while the impact at Exascale is especially susceptible. In this paper, an initial quantitative analysis of the impact of bit flipping errors to the numerical correctness of HPL has been investigated. To validate the numerical correctness of computed solution using HPL, there is a residual check after the approximate solution obtained. This paper has shown that in the case of only one bit flipping to any element in the original data matrix, if the flipped position is not the leading position of exponent, the residual check in HPL will almost surely pass at the scale of Exa flops and beyond. Experiments on modified HPL in single precision at small scales have verified the theoretical results in double precision at Exascale. The results obtained in this paper can provide a better understanding to the impact of bit flipping errors to numerical correctness of scientific computing applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131416342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingyuan Gong, Jiaqi Wang, Dongsheng Wei, Jin Wang, Xin Wang
{"title":"Optimal Node Selection for Data Regeneration in Heterogeneous Distributed Storage Systems","authors":"Qingyuan Gong, Jiaqi Wang, Dongsheng Wei, Jin Wang, Xin Wang","doi":"10.1109/ICPP.2015.48","DOIUrl":"https://doi.org/10.1109/ICPP.2015.48","url":null,"abstract":"Distributed storage systems introduce redundancy to protect data from node failures. After a storage node fails, the lost data should be regenerated at a replacement storage node as soon as possible to maintain the same level of redundancy. Minimizing such a regeneration time is critical to the reliability of distributed storage systems. Existing work commits to reduce the regeneration time by either minimizing the regenerating traffic, or adjusting the regenerating traffic patterns, whereas nodes participating the regeneration are generally assumed to be given beforehand. However, real-world distributed storage systems usually exhibit heterogeneous link capacities, and the regeneration time is highly related to the selection of the participating nodes. In this paper, we consider the minimization of the regeneration time by selecting the participating nodes in heterogeneous networks. We propose optimal node selection algorithms respectively for two cases: 1) the newcomer is not given, 2) both the newcomer and the providers are not given. Analysis shows that the optimal regeneration time can be achieved in each case. We then consider the effect of flexible amount of data blocks from each provider on the regeneration time, and apply this observation to enhance our schemes. Experiment results show that our node selection schemes can significantly reduce the regeneration time, especially in practical networks with heterogeneous link capacities, compared with the scheme based on random node selection.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131494629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Use of Hardware Transactional Memory for Parallel Mesh Generation","authors":"Tetsu Kobayashi, Shigeyuki Sato, H. Iwasaki","doi":"10.1109/ICPP.2015.69","DOIUrl":"https://doi.org/10.1109/ICPP.2015.69","url":null,"abstract":"Efficient transactional executions are desirable for parallel implementations of algorithms with graph refinements. Hardware transactional memory (HTM) is promising for easy yet efficient transactional executions. Long HTM transactions, however, abort with high probability because of hardware limitations. Unfortunately, Delaunay mesh refinement (DMR), which is an algorithm with graph refinements for mesh generation, causes long transactions. Its parallel implementation naively based on HTM therefore leads to poor performance. To utilize HTM efficiently for parallel implementation of DMR, we present an approach to shortening transactions. Our HTM based implementations of DMR achieved significantly higher throughput and better scalability than a naive HTM-based one and lock-based ones. On a quad-core Has well processor, the absolute speedup of one of our implementations was up to 2.64 with 16 threads.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133122618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jason Kane, Bo Tang, Zhen Chen, Jun Yan, Tao Wei, Haibo He, Qing Yang
{"title":"Reflex-Tree: A Biologically Inspired Parallel Architecture for Future Smart Cities","authors":"Jason Kane, Bo Tang, Zhen Chen, Jun Yan, Tao Wei, Haibo He, Qing Yang","doi":"10.1109/ICPP.2015.45","DOIUrl":"https://doi.org/10.1109/ICPP.2015.45","url":null,"abstract":"We introduce a new parallel computing and communication architecture, Reflex-Tree, with massive sensing, data processing, and control functions suitable for future smart cities. The central feature of the proposed Reflex-Tree architecture is inspired by a fundamental element of the human nervous system: reflex arcs, the neuromuscular reactions and instinctive motions of a part of the body in response to urgent situations. At the bottom level of the Reflex-Tree (layer 4), novel sensing devices are proposed that are controlled by low power processing elements. These \"leaf\" nodes are then connected to new classification engines based on machine learning techniques, including support vector machines (SVM), to form the third layer. The next layer up consists of servers that provide accurate control decisions via multi-layer adaptive learning and spatial-temporal association, before they are connected to the top level cloud where complex system behavior analysis is performed. Our multi-layered architecture mimics human neural circuits to achieve the high levels of parallelization and scalability required for efficient city-wide monitoring and feedback. To demonstrate the utility of our architecture, we present the design, implementation, and experimental evaluation of a prototype Reflex-Tree. City power supply network and gas pipeline management scenarios are used to drive our prototype as case studies. We show the effectiveness for several levels of the architecture and discuss the feasibility of implementation.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123350581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Connecting the Dots: Reconstructing Network Behavior with Individual and Lossy Logs","authors":"Jiliang Wang, Xiaolong Zheng, Xufei Mao, Zhichao Cao, Daibo Liu, Yunhao Liu","doi":"10.1109/ICPP.2015.26","DOIUrl":"https://doi.org/10.1109/ICPP.2015.26","url":null,"abstract":"In distributed networks such as wireless ad hoc networks, local and lossy logs are often available on individual nodes. We propose REFILL, which analyzes lossy and unsynchronized logs collected from individual nodes and reconstructs the network behaviors. We design an inference engine based on protocol semantics to abstract states on each node. Further we leverage inherent and implicit event correlations in and between nodes to connect interference engines and analyze logs from different nodes. Based on unsynchronized and incomplete logs, REFILL can reconstruct network behavior, recover the network scenario and understand what has happened in the network. We show that the result of REFILL can be used to guide protocol design, network management, diagnosis, etc. We implement REFILL and apply it to a large-scale wireless sensor network project. REFILL provides a detailed per-packet tracing information based on event flows. We show that REFILL can reveal and verify fundamental issues, like locating packet loss positions and root causes. Further, we present implications and demonstrate how to leverage REFILL to enhance network performance.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122679391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda
{"title":"Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store","authors":"Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda","doi":"10.1109/ICPP.2015.79","DOIUrl":"https://doi.org/10.1109/ICPP.2015.79","url":null,"abstract":"Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124966517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}