{"title":"Message from the HCW General Chair","authors":"U. Schwiegelshohn","doi":"10.1109/IPDPSW.2014.205","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.205","url":null,"abstract":"In these proceedings you find the papers corresponding to the presentations during the 23 International Heterogeneity in Computing Workshop (HCW 2014) held on May 19 in Phoenix, Arizona. Recent technological progress has generated various alternatives for the design and use of computer systems. While previous decades have seen the end of the many interesting computer architectures resulting in a reduction of heterogeneity, we experience now an increase in variety and heterogeneity on all levels of a computer system. We observe that heterogeneity has become an important feature in modern computer systems providing many opportunities to increase system performance. But the exploitation of these opportunities is really challenging and requires significant research efforts. Therefore, the topic of the 23 International Heterogeneity in Computing Workshop is at least as current as it was more than twenty years ago. In addition to designing applications that can use existing heterogeneity, we are also facing a more difficult challenge in resource management compared to handling homogeneous systems. Finally, we must reliably determine the performance of components and subsystems in such heterogeneous systems. As these issues are all present in the program of this workshop, I am sure that the papers in these proceedings will provide new and interesting information as well as stimulate new research.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127003062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Physical Stigmergy in Decentralized Optimization under Multiple Non-separable Constraints: Formal Methods and an Intelligent Lighting Example","authors":"Theodore P. Pavlic","doi":"10.1109/IPDPSW.2014.52","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.52","url":null,"abstract":"In this paper, a distributed asynchronous algorithm for intelligent lighting is presented that minimizes collective power use while meeting multiple user lighting constraints simultaneously and requires very little communication among agents participating in the distributed computation. Consequently, the approach is arbitrarily scalable, adapts to exogenous disturbances, and is robust to failures of individual agents. This algorithm is an example of a decentralized primal-space algorithm for constrained non-linear optimization that achieves coordination between agents using stigmergic memory cues present in the physical system as opposed to explicit communication and synchronization. Not only does this work make of stigmergy, a property first used to describe decentralized decision making in eusocial insects, but details of the algorithm are inspired by classic social foraging theory and more recent results in eusocial-insect macronutrient regulation. This theoretical analysis in this paper guarantees that the decentralized stigmergically coupled system converges to within a finite neighborhood of the optimal resource allocation. These results are validated using a hardware implementation of the algorithm in a small-scale intelligent lighting scenario. There are other real-time distributed resource allocation applications that are amenable to these methods, like distributed power generation, in general, this paper means to provide proof of concept that physical variables in cyberphysical systems can be leveraged to reduce the communication burden of algorithms.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127788638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuli Wang, Kenli Li, Jing Mei, Kuan-Ching Li, Yan Wang
{"title":"A Task Scheduling Algorithm Based on Replication for Maximizing Reliability on Heterogeneous Computing Systems","authors":"Shuli Wang, Kenli Li, Jing Mei, Kuan-Ching Li, Yan Wang","doi":"10.1109/IPDPSW.2014.175","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.175","url":null,"abstract":"Over the past several years, a heterogeneous computing (HC) system has become more competative as a commercial computing platform than a homogeneous system. With the growing scale of HC systems, network failures become inevitable. To achieve high performance, communication reliability should be considered while designing reliability-aware task scheduling algorithms. In this paper, we propose a new algorithm called RMSR (Replication-based scheduling for Maximizing System Reliability), which incorporates task communication into system reliability. To maximize communication reliability, an improved algorithm which searches all optimal reliability communication paths for current tasks is proposed. During the task replication phase, the task reliability threshold is determined by users and each task has dynamic replicas. Our comparative studies based on randomly generated graphs show that our RMSR algorithm outperforms existing scheduling algorithms in terms of system reliability. Several factors affecting the performance are analyzed in the paper.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121880498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Reliability of Virtual Machine Instances with Dynamic Pricing in the Public Cloud","authors":"Seung-Hwan Lim, Gautam Thakur, James L. Horey","doi":"10.1109/IPDPSW.2014.101","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.101","url":null,"abstract":"This study presents reliability analysis of virtual machine instances in public cloud environments in the face of dynamic pricing. Different from traditional fixed pricing, dynamic pricing allows price to dynamically fluctuate over arbitrary period of time according to external factors such as supply and demand, excess capacity, etc. This pricing option introduces a new type of fault: virtual machine instances may be unexpectedly terminated due to conflicts in the original bid price and the current offered price. This new class of fault under dynamic pricing may be more dominant than traditional faults in cloud computing environments, where resource availability associated with traditional faults is often above 99.9%. To address and understand this new type of fault, we translated two classic reliability metrics, mean time between failures and availability, to the Amazon Web Services spot market using historical price data. We also validated our findings by submitting actual bids in the spot market. We found that overall, our historical analysis and experimental validation lined up well. Based upon these experimental results, we also provided suggestions and techniques to maximize overall reliability of virtual machine instances under dynamic pricing.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122523054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent GPU Execution of NumPy Applications","authors":"Troels Blum, M. R. B. Kristensen, B. Vinter","doi":"10.1109/IPDPSW.2014.114","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.114","url":null,"abstract":"In this work, we present a back-end for the Python library NumPy that utilizes the GPU seamlessly. We use dynamic code generation to generate kernels, and data is moved transparently to and from the GPU. For the integration into NumPy, we use the Bohrium runtime system. Bohrium hooks into NumPy through the implicit data parallelization of array operations, this approach requires no annotations or other code modifications. The key motivation for our GPU computation back-end is to transform high-level Python/NumPy applications to the lowlevel GPU executable kernels, with the goal of obtaining highperformance, high-productivity and high-portability, HP3. We provide a performance study of the GPU back-end that includes four well-known benchmark applications, Black-Scholes, Successive Over-relaxation, Shallow Water, and N-body, implemented in pure Python/NumPy. We demonstrate an impressive 834 times speed up for the Black-Scholes application, and an average speedup of 124 times across the four benchmarks.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122594444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. R. B. Kristensen, S. Lund, Troels Blum, K. Skovhede, B. Vinter
{"title":"Bohrium: A Virtual Machine Approach to Portable Parallelism","authors":"M. R. B. Kristensen, S. Lund, Troels Blum, K. Skovhede, B. Vinter","doi":"10.1109/IPDPSW.2014.44","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.44","url":null,"abstract":"In this paper we introduce, Bohrium, a runtime-system for mapping vector operations onto a number of different hardware platforms, from simple multi-core systems to clusters and GPU enabled systems. In order to make efficient choices Bohrium is implemented as a virtual machine that makes runtime decisions, rather than a statically compiled library, which is the more common approach. In principle, Bohrium can be used for any programming language but for now, the supported languages are limited to Python, C++ and the. Net framework, e.g. C# and F#. The primary success criteria are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, current and future, processors. We evaluate the presented design through a setup that targets a multi-core CPU, an eight-node Cluster, and a GPU, all preliminary prototypes. The evaluation includes three well-known benchmark applications, Black Sholes, Shallow Water, and N-body, implemented in C++, Python, and C# respectively.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122806636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Sevilla, I. Nassi, Kleoni Ioannidou, S. Brandt, C. Maltzahn
{"title":"SupMR: Circumventing Disk and Memory Bandwidth Bottlenecks for Scale-up MapReduce","authors":"Michael Sevilla, I. Nassi, Kleoni Ioannidou, S. Brandt, C. Maltzahn","doi":"10.1109/IPDPSW.2014.168","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.168","url":null,"abstract":"Reading input from primary storage (i.e. the ingest phase) and aggregating results (i.e. the merge phase) are important pre- and post-processing steps in large batch computations. Unfortunately, today's data sets are so large that the ingest and merge job phases are now performance bottlenecks. In this paper, we mitigate the ingest and merge bottlenecks by leveraging the scale-up MapReduce model. We introduce an ingest chunk pipeline and a merge optimization that increases CPU utilization (50-100%) and job phase speedups (1.16× - 3.13×) for the ingest and merge phases. Our techniques are based on well-known algorithms and scale-out MapReduce optimizations, but applying them to a scale-up computation framework to mitigate the ingest and merge bottlenecks is novel.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121682587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SmartBricks: A Visual Environment to Design and Explore Novel Custom Domain-Specific Architectures","authors":"Anil Kumar Sistla, Xiaozhong Luo, Mukund Malladi, M. Reisner, Rajasekhar Ganduri, Gayatri Mehta","doi":"10.1109/IPDPSW.2014.22","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.22","url":null,"abstract":"Custom domain-specific architectures are very promising for creating designs that are highly optimized to the needs of a particular application domain. However, it is extremely difficult to find optimal tradeoffs in designing a new architecture, or even to fully understand the design space. Therefore, there is a great need to develop an optimum design framework that allows designers to explore the design space efficiently and identify efficient architectures quickly for an application domain. In this paper, we describe SmartBricks, a highly visual design environment that we have developed for designing and exploring custom domain-specific architectures quickly and efficiently. This game-like design environment will be accessible to a broad community so that even non-engineers and non-scientists can contribute to building and exploring out-of-the-box architectural designs.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121821024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MTAAP Introduction and Committees","authors":"L. DeRose","doi":"10.1109/IPDPSW.2014.225","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.225","url":null,"abstract":"Multithreading (MT) programming and execution models, as well as Many Integrated Core (MIC) and hybrid programming with accelerated architectures, are now part of the high-end and mainstream computing scene. This trend has been driven by the need to increase processor utilization and deal with the memory-processor speed gap. Recent and upcoming examples architectures and processors that fit this profile are Cray's XK and XMT, NVIDIA Kepler, Intel Phi, IBM Cyclops, and several SMT processors from IBM (Power7), AMD, or Intel, as well as heterogeneous clusters with accelerators from AMD and NVIDIA. The underlying rationale to increase processor utilization is a varying mix of new metrics that take performance improvements as well as better power and cost budgeting into account. Yet, it remains a challenge to identify and productively program applications for these architectures with a resulting substantial performance improvement.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124996062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WECPAR: List Ranking Algorithm and Relative Computational Power","authors":"H. M. El-Boghdadi","doi":"10.1109/IPDPSW.2014.78","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.78","url":null,"abstract":"Reconfigurable models were shown to be very powerful in solving many problems faster than non reconfigurable models. WECPAR W(M,N,k) is an M × N reconfigurable model that has point-to-point reconfigurable interconnection with k wires between neighboring processors. This paper studies several aspects of WECPAR. We first solve the list ranking problem on WECPAR. Some of the results obtained show that ranking one element in a list of N elements can be solved on W(N,N,N) WECPAR in O(1) time. Also, on W(N,N,k), ranking a list L(N) of N elements can be done in O((log N)([logk+1])) time. To transfer a large body of algorithms to work on WECPAR and to assess its relative computational power, several simulations algorithms are introduced between WECPAR and well-known models such as PRAM and RMBM. Simulations algorithms show that a PRIORITY CRCW PRAM of N processors and S shared memory locations can be simulated by an W(S, N, k) WECPAR in O([logk+1 N]+[log Sk+1]) time. Also, we show that a PRIORITY CRCW Basic-RMBM(P,B), of P processors and B buses can be simulated by an W(B, P+B, k) WECPAR in O([logk+1 (P+B)]) time. This has the effect of migrating a large number of algorithms to work directly on WECPAR with the simulation overhead.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125001533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}