Jiarui Fang, H. Fu, He Zhang, Wei Wu, N. Dai, L. Gan, Guangwen Yang
{"title":"Optimizing Complex Spatially-Variant Coefficient Stencils for Seismic Modeling on GPU","authors":"Jiarui Fang, H. Fu, He Zhang, Wei Wu, N. Dai, L. Gan, Guangwen Yang","doi":"10.1109/ICPADS.2015.86","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.86","url":null,"abstract":"The Explicit Time Evolution (ETE) method is an innovative Finite-Difference (FD) type method to simulate the wave propagation in acoustic media with higher spatial and temporal accuracy. However, different from FD, it is difficult to achieve an efficient GPU design because of the poor memory access patterns caused by the off-axis points and spatially-variant coefficients. In this paper, we present a set of new optimization strategies for ETE stencils according to the memory hierarchy of NVIDIA GPU. To handle the problem caused by the complexity of the stencil shapes, we design a one-to-multi updating scheme for shared memory usage. To alleviate the performance damage resulted from the poor memory access pattern of reading spatially-variant coefficients, we propose a stencil decomposition method to reduce un-coalesced global memory access. Based on the state-of-the-art GPU architecture, combining with existing spatial and temporal stencil blocking schemes, we manage to achieve 9.6x and 9.9x speedups compared with a well-tuned 12-core CPUs version for 37-point and 73-point ETE stencils, respectively. Compared with a well-tuned MIC version, the best speedups for the 2 type stencils are 3.7x and 4.7x. Our designs leads to an ETE method that is 31.2x faster than conventional CPU-FD method and make it a practical seismic imaging technology.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129269762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luka Stanisic, E. Agullo, A. Buttari, A. Guermouche, Arnaud Legrand, Florent Lopez, B. Videau
{"title":"Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers","authors":"Luka Stanisic, E. Agullo, A. Buttari, A. Guermouche, Arnaud Legrand, Florent Lopez, B. Videau","doi":"10.1109/ICPADS.2015.67","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.67","url":null,"abstract":"The ever growing complexity and scale of parallel architectures imposes to rewrite classical monolithic HPC scientific applications and libraries as their portability and performance optimization only comes at a prohibitive cost. There is thus a recent and general trend in using instead a modular approach where numerical algorithms are written at a high level independently of the hardware architecture as Directed Acyclic Graphs (DAG) of tasks. A task-based runtime system then dynamically schedules the resulting DAG on the different computing resources, automatically taking care of data movement and taking into account the possible speed heterogeneity and variability. Evaluating the performance of such complex and dynamic systems is extremely challenging especially for irregular codes. In this article, we explain how we crafted a faithful simulation, both in terms of performance and memory usage, of the behavior of qr_mumps, a fully-featured sparse linear algebra library, on multi-core architectures. In our approach, the target high-end machines are calibrated only once to derive sound performance models. These models can then be used at will to quickly predict and study in a reproducible way the performance of such irregular and resource-demanding applications using solely a commodity laptop.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131222468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Q. Zheng, Jia Li, B. Dong, R. Li, N. Shah, Feng Tian
{"title":"Multi-objective Optimization Algorithm Based on BBO for Virtual Machine Consolidation Problem","authors":"Q. Zheng, Jia Li, B. Dong, R. Li, N. Shah, Feng Tian","doi":"10.1109/ICPADS.2015.59","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.59","url":null,"abstract":"Cloud computing is a promising technology having ability to influence the way of the provision of computing and storage resources through virtual machine (VM). VM Consolidation is an efficient way to improve power efficiency and quality guarantee for on-demand services. However, it is an integer programming problem and as well as a NP-hard problem to find optimal solutions within polynomial time. In this paper, the VM consolidation problem is formulated as a multi-objective optimization problem, which has three conflicting objectives, i.e., reducing power consumption, achieving good load balancing and shortening VM migration time. We propose a multi-objective optimization algorithm based on biogeography-based optimization (BBO) for the VM consolidation problem, which is named as MBBO/DE: Multi-objective Biogeography-Based Optimization algorithm hybrid with Differential Evolution. It utilizes cosine migration model, differential strategies and Gaussian mutation model to improve the quality of habitats and the ability of finding optimal solutions. Experiments have been conducted to evaluate the effectiveness of MBBO/DE using synthetic and real-world instances. Experimental results show that MBBO/DE obtains a better performance while simultaneously reducing power consumption and achieving good load balancing within a satisfactory time as compared to genetic algorithm (GA), differential evolution (DE), ant colony optimization (ACO) and BBO.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Service-Oriented Mobile Cloud Middleware Framework for Provisioning Mobile Sensing as a Service","authors":"Chii Chang, S. Srirama, M. Liyanage","doi":"10.1109/ICPADS.2015.24","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.24","url":null,"abstract":"Emerging Mobile Phone Sensing (M-Sense) systems enable a flexible large scale wireless sensing capability and also reduce the need of establishing the infrastructure of Wireless Sensor Network for collecting sensory information in the Internet of Things applications. M-Sense has been applied in numerous scenarios including mobile-health systems, environmental monitoring, vehicle ad hoc network, mobile social network, and so on. The drawback of existing M-Sense systems in terms of privacy, trust, less efficiency of participating in multiple sensing networks, has motivated the next generation sensing service provisioning approach. This paper introduces a generic service-oriented Mobile Host Sensing as a Service provisioning framework that allows a mobile device to provide sensing data to multiple parties based on mobile Web services. The proposed framework consists of the hybrid workflow-based control system, the dynamic Utility Cloud service, and the service provisioning scheduling model to enhance the quality of service provisioning. The prototype has been tested on real mobile devices and the details of the performance evaluation are presented.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114655182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Realizing Extremely Large-Scale Stencil Applications on GPU Supercomputers","authors":"Toshio Endo, Y. Takasaki, S. Matsuoka","doi":"10.1109/ICPADS.2015.84","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.84","url":null,"abstract":"The problem of deepening memory hierarchy towards exascale is becoming serious for applications such as those based on stencil kernels, as it is difficult to satisfy both high memory bandwidth ad capacity requirements simultaneously. This is evident even today, where problem sizes of stencil-based applications on GPU supercomputers are limited by aggregated capacity of GPU device memory. Locality improvement techniques such as temporal blocking is known to preserve performance, but integrating the technique into existing stencil applications results in substantially higher programming cost, especially for complex applications and as a result are not typically utilized. We alleviate this problem with a run-time GPU-MPI process virtualization library we call HHRT that automates data movement across the memory hierarchy, and a systematic methodology to convert and optimize the code to accommodate temporal blocking. The proposed methodology has shown to significantly eases the adaptation of real applications, such as the whole-city airflow simulator embodying more than 12,000 lines of code; with careful tuning, we successfully maintain up to 85% performance even with problems whose footprint is four time larger than GPU device memory capacity, and scale to hundreds of GPUs on the TSUBAME2.5 supercomputer.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121250122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Zang, Zheng Cao, Xiaoli Liu, L. Wang, Zhan Wang, Ninghui Sun
{"title":"PROP: Using PCIe-Based RDMA to Accelerate Rack-Scale Communications in Data Centers","authors":"Dawei Zang, Zheng Cao, Xiaoli Liu, L. Wang, Zhan Wang, Ninghui Sun","doi":"10.1109/ICPADS.2015.65","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.65","url":null,"abstract":"In order to reduce the demands on bandwidth of core layer network, data center operators usually assign tasks of the same job to servers that are located in the same rack, leading to the fact that 80% of the traffic originated from servers retains in the same rack. As a result, providing sufficient network capacity inside racks becomes critical to the Quality-of-Service of current data center applications. In this paper, we propose PROP, a novel hybrid network architecture which leverages PCIe-based RDMA to reinforce rack-scale connectivity in data centers. In our design, intra-rack bulk data transfers will be accelerated by a dedicated high-bandwidth PCIe-compliant network while complemented with the existing Ethernet network. In addition, we develop a proprietary PCIe-based RDMA hardware which can allow the servers in the same rack to exchange data in main memory without involving the operating system and the processors. We also implement a software stack to enable existing socket-based applications to transparently utilize the proposed dedicated network system. As the preliminary stage, this paper focuses on exploiting the unique design point and implements an FPGA-based prototype to validate the technical feasibility of the proposed architecture.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121427141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CER-IOS: Internal Resource Utilization Optimized I/O Scheduling for Solid State Drives","authors":"Xuchao Xie, Dengping Wei, Qiong Li, Zhenlong Song, Liquan Xiao","doi":"10.1109/ICPADS.2015.50","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.50","url":null,"abstract":"Modern Solid State Drives (SSDs) integrate more internal resources to get higher performance and capacity. Improving internal resource utilization by exploiting internal parallelism is important to enhance the performance of SSDs. Unfortunately, the internal resource utilization of SSDs is limited at runtime in practice because of the practical access conflicts to internal resources. In this paper, we propose a Conflict Eliminated Requests Based I/O Scheduler (CER-IOS) to better utilize internal parallelism of flash chips by scheduling I/O requests in a more fine-grained way. We introduce Conflict Eliminated Requests (CERs) in which parallelizable memory requests are grouped during the process of address translation in Flash Translation Layer. To schedule conflicting requests, we propose a small CER size prioritized resource distribution scheme, that ensures internal resources can always be distributed to valuable conflicting requests to further improve the efficiency of resource utilization. Our extensive experimental evaluation results show that CER-IOS provides significant improvement of resource utilization at runtime and reduces average I/O latency largely compared to state-of-the-art I/O schedulers implemented in operating systems.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123892481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OutSense: Out-of-Band Sensing with ZigBee Sensors for Channel Adaptation in Wireless LANs","authors":"Yanmin Zhu, Lubin Liu, Juan Li, Jiadi Yu, C. Long","doi":"10.1109/ICPADS.2015.20","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.20","url":null,"abstract":"Wireless local area networks (WLANs) are pervasive but crowded nowadays. It is of great importance for access points (APs) to adapt to the changing traffic conditions. Exiting approaches for channel selection largely rely on local channel assessment and adopt greedy selection strategies. They suffer a major limitation that an AP fail to take various traffic demands of clients into account. We have witnessed that wireless sensor networks are increasingly deployed everywhere. A ZigBee sensor operates on the 2.4G radio spectrum which overlaps the spectrum used by most WiFi APs. As a result, a ZigBee sensor is able to sense the traffic of different AP channels. Motivated by this important observation, we present the design, implementation and evaluation of OutSence, a system that enables APs to takes traffic volumes of clients into account. It makes use of channel utilization sensed by ZigBee sensors and allows an AP to select a channel of good performance. The salient feature of OutSence is that it exploits in-situ ZigBee sensors for APs to quickly adapt to short-term traffic variations (e.g., order of minutes). We have fully implemented OutSence on Telos B sensor nodes and off-the-self APs. Extensive experiments have been conducted and conclusive results demonstrate that OutSence effectively improves overall WLAN performance.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124221537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlashLite: A High Performance Machine for Data Intensive Science","authors":"D. Abramson","doi":"10.1109/ICPADS.2015.17","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.17","url":null,"abstract":"Data is predicted to transform the 21st century, fuelled by an exponential growth in the amount of data captured, generated and archived. Traditional high performance machines are optimized for numerical computing rather than IO performance or for supporting large memory applications. This paper discusses a new machine, called FlashLite, which addresses these challenges. The paper describes the motivation for the design, and discusses some driving application themes.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122398306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kefeng Deng, Kaijun Ren, Shaowei Liu, Junqiang Song
{"title":"DAG Scheduling for Heterogeneous Systems Using Biogeography-Based Optimization","authors":"Kefeng Deng, Kaijun Ren, Shaowei Liu, Junqiang Song","doi":"10.1109/ICPADS.2015.94","DOIUrl":"https://doi.org/10.1109/ICPADS.2015.94","url":null,"abstract":"Efficient scheduling algorithm is critical for DAG-based applications to obtain high-performance in heterogeneous computing systems. In comparison with heuristic-based algorithms, meta-heuristic based scheduling algorithms can produce better results by searching in a guided manner. Biogeography-based optimization (BBO) is a recently proposed optimization technique which has shown less parameters, faster convergency, and superior performance than existing meta-heuristics. In this article, we introduce this novel optimization technique into the field of DAG scheduling. To reduce scheduling overhead, the proposed algorithm only encodes task mapping while using a heuristic strategy to determine task ordering. Moreover, it uses heuristic-based algorithms as baseline algorithms to obtain better results. We evaluate the BBO-based scheduling algorithm using three real world DAG-based applications under various parameter settings. The results show that the BBO-based scheduling algorithm outperforms the state-of-the-art meta-heuristic based algorithms.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122822091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}