P. Igounet, E. Dufrechou, M. Pedemonte, P. Ezzatti
{"title":"A Study on Mixed Precision Techniques for a GPU-based SIP Solver","authors":"P. Igounet, E. Dufrechou, M. Pedemonte, P. Ezzatti","doi":"10.1109/WAMCA.2012.17","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.17","url":null,"abstract":"This article presents the study and application of mixed precision techniques to accelerate a GPU-based implementation of the Strongly Implicit Procedure (SIP) to solve hepta-diagonal linear systems. In particular, two different options to incorporate mixed precision in the GPU implementation are discussed and one of them is implemented. The experimental evaluation of our proposal demonstrates that a runtime similar to a single precision implementation on GPU can be attained, but achieving a numerical accuracy comparable to double precision arithmetic.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131284119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Sardinha, Tiago A. O. Alves, L. A. J. Marzulo, Felipe M. G. França, Valmir C. Barbosa, Vítor Santos Costa
{"title":"Scheduling Cyclic Task Graphs with SCC-Map","authors":"Alexandre Sardinha, Tiago A. O. Alves, L. A. J. Marzulo, Felipe M. G. França, Valmir C. Barbosa, Vítor Santos Costa","doi":"10.1109/WAMCA.2012.8","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.8","url":null,"abstract":"The Dataflow execution model has been shown to be a good way of exploiting TLP, making parallel programming easier. In this model, tasks must be mapped to processing elements (PEs) considering the trade-off between communication and parallelism. Previous work on scheduling dependency graphs have mostly focused on directed a cyclic graphs, which are not suitable for dataflow (loops in the code become cycles in the graph). Thus, we present the SCC-Map: a novel static mapping algorithm that considers the importance of cycles during the mapping process. To validate our approach, we ran a set of benchmarks in on our dataflow simulator varying the communication latency, the number of PEs in the system and the placement algorithm. Our results show that the benchmark programs run significantly faster when mapped with SCC-Map. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132391359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Level Implementation of STM Haskell with Write/Write Conflict Detection","authors":"A. D. Du Bois, M. Pilla, R. M. Duarte","doi":"10.1109/WAMCA.2012.9","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.9","url":null,"abstract":"This paper describes a high level implementation of Software Transactional Memory (STM) for the Haskell language. The library is implemented completely in Haskell and, as opposed to all other implementation of STM Haskell, it features early detection of write/write conflicts. Preliminary performance measurements using the Haskell STM benchmark show that the library performs much better than a TL2~implementation written in Haskell, and performs reasonably well compared to the current implementation of STM Haskell written in C.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116685934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. M. Coelho, Matheus Nohra Haddad, L. S. Ochi, M. Souza, R. Farias
{"title":"A Hybrid CPU-GPU Local Search Heuristic for the Unrelated Parallel Machine Scheduling Problem","authors":"I. M. Coelho, Matheus Nohra Haddad, L. S. Ochi, M. Souza, R. Farias","doi":"10.1109/WAMCA.2012.16","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.16","url":null,"abstract":"This work addresses the development of a hybrid CPU-GPU local search heuristic for the unrelated parallel machine scheduling problem. In this scheduling problem setup times are sequence-dependent and also machine-dependent. The objective is to minimize the maximum completion time of the schedule, known as make span. Since the problem belongs to the NP-hard class there is no known polynomial time algorithm to solve it, so metaheuristics and local search heuristics are usually developed to find good near optimal solutions. In general, the local search is the most expensive part of the heuristic method, so our algorithm harnesses the tremendous computing power of the GPU to decrease the local search computational time. We use the local search based on swapping jobs in different machines, since it is able find good near optimal solutions as we report from previous results in literature. We show that the hybrid CPU-GPU local search achieves average speedups from 10 to 27 times in relation to the pure CPU local search.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121355163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Virtual Channel Implementation Technique for Multi-core On-chip Communication","authors":"Masoud Oveis Gharan, G. Khan","doi":"10.1109/WAMCA.2012.12","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.12","url":null,"abstract":"In this paper, a new approach for implementing virtual channels (VC) for multi-core interconnection networks is presented. In this approach, the flits of different packets interleave in a channel with a single buffer of nominal depth by using a rotating flit-by-flit arbitration. The routing path of each flit is guaranteed because the flits belonging to the same packet are attached with an ID tag at each router so that they are differentiable at downstream routers. We present this on-chip communication of packets through sharing of channel and buffer, which is a novel method of virtual channel implementation. Furthermore, we demonstrate it by adding arbitrary virtual channels depending on the number of packet requests for a physical channel. In this way, NoC (Network-on-Chip) contention can be removed cheaply. Moreover, we discuss contention free communication where the depth of shared buffer does not affect the performance. A contention-free communication with small (one) buffer depth can create an efficient on-chip communication with high performance, small chip area and low power consumption.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122926703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Virtual Channel Insertion for Contention Alleviation and Deadlock Avoidance in Custom NoCs","authors":"A. Tino, G. Khan","doi":"10.1109/WAMCA.2012.11","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.11","url":null,"abstract":"Deadlock and contention can be avoided in an NoC architecture by employing virtual channels (VC). VC insertion can result in power and chip area increases with little performance improvements. We present a novel VC insertion technique for deadlock avoidance and contention relief in irregular NoC architectures that avoids significant power and area increase. Given a resource pool of VCs, deadlock/contention analytical models, and a systematic pre-evaluation technique, minimal VC resources are inserted resulting in higher performance. Several experiments are conducted on various SoC benchmark applications. The results of our technique indicate an average performance improvement of 21%, 32.4% decrease in power dissipation and 79.5% resource savings as compared to past techniques.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125987398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Doost, S. M. Sadjadi, J. R. Da Silva, M. Zamith, M. Joselli, E. Clua
{"title":"Architecture of Request Distributor for GPU Clusters","authors":"M. Doost, S. M. Sadjadi, J. R. Da Silva, M. Zamith, M. Joselli, E. Clua","doi":"10.1109/WAMCA.2012.15","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.15","url":null,"abstract":"The advent of GPU computing has enabled development of many strategies for accelerating different kinds of simulations. Even further, instead of processing an application by just using one GPU, it is a common to use a collection of GPUs as a solution. These GPUs can be located in the same machine, network, or even across a wide area network. Unfortunately, distribution and management of GPUs requires additional efforts by the user such as deal with data transfer, connection and processing among GPUs. Request distributor for GPU clusters (RDGPUC) is a software architecture which allows companies, institutes and other users to share their GPU resources. By using this architecture, each cluster can have its own software to manage internal resources and they only need to develop small code to interact with RDGPUC. This novel design brings flexibility to the system and allows everyone to share their resources without need to change their GPU cluster tool. Another interesting part of system is to allow users to submit requests from all kind of devices and platforms. Admin of this system is able to specify resource groups and special schedules for using resources. On the other hand, end-users can just use a simple interface to submit their requests on RDGPUC without knowing about internal design and current status of GPU clusters.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125970713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Padoin, D. A. G. de Oliveira, P. Velho, P. Navaux
{"title":"Time-to-Solution and Energy-to-Solution: A Comparison between ARM and Xeon","authors":"E. Padoin, D. A. G. de Oliveira, P. Velho, P. Navaux","doi":"10.1109/WAMCA.2012.10","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.10","url":null,"abstract":"Most High Performance Computing (HPC) systems today are known as \"power hungry\" because they aim at computing speed regardless to energy consumption. Some scientific applications still claim more speed and the community expects to reach exascale by the end of the decade. Nevertheless, to reach exascale we need to search alternatives to cope with energy constraints. A promising step forward in this direction is the usage of low power processors such as ARM. ARM processors target low power consumption in contrast with Xeon that are conventional on HPC aiming at computing speed. This paper presents a comparison between ARM and Xeon to evaluate if ARM is the future building block to HPC. We choose to use time-to-solution, peak power, and energy-to-solution to evaluate both processors from the user's perspective. The results point that although ARM having lower peak power, Xeon has still a better tradeoff from the user's point-of-view.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127734411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autotuning Wavefront Abstractions for Heterogeneous Architectures","authors":"S. Mohanty, M. Cole","doi":"10.1109/WAMCA.2012.14","DOIUrl":"https://doi.org/10.1109/WAMCA.2012.14","url":null,"abstract":"We present our auto tuned heterogeneous parallel programming abstraction for the wave front pattern. An exhaustive search of the tuning space indicates that correct setting of tuning factors can average 37x speedup over a sequential baseline. Our best automated machine learning based heuristic obtains 92% of this ideal speedup, averaged across our full range of wave front examples.","PeriodicalId":288438,"journal":{"name":"2012 Third Workshop on Applications for Multi-Core Architecture","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115711474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}