J. Vasconcellos, E. Cáceres, H. Mongelli, S. W. Song
{"title":"A Parallel Algorithm for Minimum Spanning Tree on GPU","authors":"J. Vasconcellos, E. Cáceres, H. Mongelli, S. W. Song","doi":"10.1109/SBAC-PADW.2017.20","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.20","url":null,"abstract":"Computing a minimum spanning tree (MST) of a graph is a fundamental problem in Graph Theory and arises as a subproblem in many applications. In this paper, we propose a parallel MST algorithm and implement it on a GPU (Graphics Processing Unit). One of the steps of previous parallel MST algorithms is a heavy use of parallel list ranking. Besides the fact that list ranking is present in several parallel libraries, it is very time-consuming. Using a different graph decomposition, called strut, we devised a new parallel MST algorithm that does not make use of the list ranking procedure. Based on the BSP/CGM model we proved that our algorithm is correct and it finds the MST after O(log p) iterations (communication and computation rounds). To show that our algorithm has a good performance onreal parallel machines, we have implemented it on GPU. The way that we have designed the parallel algorithm allowed us to exploit the computing power of the GPU. The efficiency of the algorithm was confirmed by our experimental results. The tests performed show that, for randomly constructed graphs, with vertex numbers varying from 10,000 to 30,000 and density between 0.02 and 0.2, the algorithm constructs an MST in a maximum of six iterations. When the graph is not very sparse, our implementation achieved a speedup of more than 50, for some instances as high 296, over a minimum spanning tree sequential algorithm previously proposed in the literature.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122308140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Machado, R. Almeida, Andre D. Jardim, A. Pernas, A. Yamin, G. H. Cavalheiro
{"title":"Comparing Performance of C Compilers Optimizations on Different Multicore Architectures","authors":"R. Machado, R. Almeida, Andre D. Jardim, A. Pernas, A. Yamin, G. H. Cavalheiro","doi":"10.1109/SBAC-PADW.2017.13","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.13","url":null,"abstract":"Multithread programming tools become popular for exploitation of high performance processing with the dissemination of multicore processors. In this context, it is also popular to exploit compiler optimization to improve the performance at execution time. In this work, we evaluate the performance achieved by the use of flags -O1, -O2, and -O3 of two C compilers (GCC and ICC) associated with five different APIs: Pthreads, C++11, OpenMP, Cilk Plus, and TBB. The experiments were performed on two distinct but compatible architectures (Intel Xeon and AMD Opteron). In our experiments, the use of optimization improves the performance independently from the API. We observe that the application scheduling performed by the programming interfaces providing an application level scheduler has more impact on the final performance than the optimizations.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dataflow Implementation of Region Growing Method for Cracks Segmentation","authors":"L. A. J. Marzulo, A. Sena, G. Mota, O. Gomes","doi":"10.1109/SBAC-PADW.2017.22","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.22","url":null,"abstract":"Region growing is an image segmentation algorithm extremely useful for continuous regions extraction. It defines an initial set of seeds, according to a specific criteria, and iteratively aggregates similar neighbor pixels. The algorithm converges when no pixel aggregation is performed in a certain iteration. Within this research project, region growing is employed for the segmentation of cracks in images of ore particles acquired by scanning electron microscopy (SEM). The goal is to help scientists evaluate the efficiency of cracking methods that would improve metal exposure for extraction through heap leaching and bioleaching. However, this is a computational intensive application that could take hours to analyze even a small set of images, if executed sequentially. This paper presents and evaluates a dataflow parallel version of the region growing method for cracks segmentation. The solution employs the Sucuri dataflow library for Python to orchestrate the execution in a computer cluster. Since the application processes images of different sizes and complexity, Sucuri played an important role in balancing load between machines in a transparent way. Experimental results show speedups of up to 26.85 in a small cluster with 40 processing cores and 23.75 in a 36-cores machine.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126866253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Serpa, E. Cruz, M. Diener, Arthur M. Krause, Albert Farrés, C. Rosas, J. Panetta, Mauricio Hanzich, P. Navaux
{"title":"Strategies to Improve the Performance of a Geophysics Model for Different Manycore Systems","authors":"M. Serpa, E. Cruz, M. Diener, Arthur M. Krause, Albert Farrés, C. Rosas, J. Panetta, Mauricio Hanzich, P. Navaux","doi":"10.1109/SBAC-PADW.2017.17","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.17","url":null,"abstract":"Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave propagation simulation. To perform such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for five architectures: Intel Haswell, Intel Knights Corner, Intel Knights Landing, NVIDIA Kepler and NVIDIA Maxwell. We focus on improving the cache memory usage, vectorization, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Maxwell improves over Intel Haswell, Intel Knights Corner, Intel Knights Landing and NVIDIA Kepler performance by up to 17.9x.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133145261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient In-Situ Quantum Computing Simulation of Shor's and Grover's Algorithms","authors":"A. Avila, R. Reiser, A. Yamin, M. Pilla","doi":"10.1109/SBAC-PADW.2017.19","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.19","url":null,"abstract":"Exponential increase and global access to read/write memory states in quantum computing simulation limit both the number of qubits and quantum transformations that can be currently simulated. Although quantum computing simulation is parallel by nature, spatial and temporal complexity are major performance hazards, making this an important application for HPC. A new methodology employing reduction and decomposition optimizations has shown great results, but its GPU implementation could be further improved. In this work, we intend to do a new implementation for in-situ GPU simulation that better explores its resources without requiring further HPC hardware. Shors and Grovers algorithms are simulated and compared to the previous version and to LIQUi|s simulator, showing better results with relative speedups up to 15.5x and 765.76x respectively.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114710028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Souza, T. T. Cota, Matheus M. Queiroz, H. Freitas
{"title":"Energy Consumption Improvement of Shared-Cache Multicore Clusters Based on Explicit Simultaneous Multithreading","authors":"M. Souza, T. T. Cota, Matheus M. Queiroz, H. Freitas","doi":"10.1109/SBAC-PADW.2017.9","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.9","url":null,"abstract":"The use of multicore clusters is one of the strategies used to achieve energy-efficient multicore architecture designs. Even though chips have multiple cores in these designs, cache constraints such as size, latency, concurrency, and scalability still apply. Multicore clusters must therefore implement alternative solutions to the shared cache access problem. Bigger or more frequently accessed caches consume more energy, which is a problem in explicit multithread concurrency. In this work, we simulate different multicore cluster architectures to identify the best configuration in terms of energy efficiency, concerning a varying number of cores, cache sizes and sharing strategies. We also observe the simultaneous and individual multithreading concurrency of two application groups. The results showed that for applications with regular tasks loads, the simultaneous multithreading approach was 43.6% better than the individual one, in terms of energy consumption. For irregular tasks loads, individual executions proved to be the best option, with an increase of up to 81.3% in energy efficiency. We also concluded that shared L2 caches were up to 13.4% more energy-efficient than private cache configurations.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121753529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Algorithm for Dynamic Community Detection","authors":"Hugo Resende, Á. Fazenda, M. G. Quiles","doi":"10.1109/SBAC-PADW.2017.18","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.18","url":null,"abstract":"Many real systems can be naturally modeled by complex networks. A complex network represents an abstraction of the system regarding its components and their respective interactions. Thus, by scrutinizing the network, interesting properties of the system can be revealed. Among them, the presence of communities, which consists of groups of densely connected nodes, is a significant one. For instance, a community might reveal patterns, such as the functional units of the system, or even groups correlated people in social networks. Albeit important, the community detection process is not a simple computational task, in special when the network is dynamic. Thus, several researchers have addressed this problem providing distinct methods, especially to deal with static networks. Recently, a new algorithm was introduced to solve this problem. The approach consists of modeling the network as a set of particles inspired by a N-body problem. Besides delivering similar results to state-of-the-art community detection algorithm, the proposed model is dynamic in nature; thus, it can be straightforwardly applied to time-varying complex networks. However, the Particle Model still has a major drawback. Its computational cost is quadratic per cycle, which restricts its application to mid-scale networks. To overcome this limitation, here, we present a novel parallel algorithm using many-core high-performance resources. Through the implementation of a new data structure, named distance matrix, was allowed a massive parallelization of the particles interactions. Simulation results show that our parallel approach, running both traditional CPUs and hardware accelerators based on multicore CPUs and GPUs, can speed up the method permitting its application to large-scale networks.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133833120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé
{"title":"Impact of Version Management for Transactional Memories on Phase-Change Memories","authors":"Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé","doi":"10.1109/SBAC-PADW.2017.24","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.24","url":null,"abstract":"Two of the major issues in current computer systems are energy consumption and how to explore concurrent systems in a correct and efficient way. Solutions for these hazards may be sought both in hardware and in software. Phase-Change Memory (PCM) is a memory technology intended to replace DRAMs (Dynamic Random Access Memories) as the main memory, providing reduced static power consumption. Their main problem is related to write operations that are slow and wear their material. Transactional Memories are synchronization methods developed to reduce the limitations of lock-based synchronization. Their main advantages are related to being high-level and allowing composition and reuse of code, besides the absence of deadlocks. The objective of this study is to analyze the impact of different versioning managers (VMs) for transactional memories in PCMs. The lazy versioning/lazy acquisition scheme for version management presented the lowest wear on the PCM in 3 of 7 benchmarks analyzed, and results similar to the alternative versioning for the other 4~benchmarks. These results are related to the number of aborts of VMs, where this VM presents a much smaller number of aborts than the others, up to 39 times less aborts in the experiment with the benchmark Kmeans with 64 threads.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124798445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. L. Cabral, Carla Osthoff, Gabriel P. Costa, Diego N. Brandão, M. Kischinhevsky, S. L. G. D. Oliveira
{"title":"Tuning Up TVD HOPMOC Method on Intel MIC Xeon Phi Architectures with Intel Parallel Studio Tools","authors":"F. L. Cabral, Carla Osthoff, Gabriel P. Costa, Diego N. Brandão, M. Kischinhevsky, S. L. G. D. Oliveira","doi":"10.1109/SBAC-PADW.2017.12","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.12","url":null,"abstract":"This paper focuses on the parallelization of TVD Method scheme for numerical time integration of evolutionary differential equations. The Hopmoc method for numerical integration of differential equations was developed aiming at benefiting from both the concept of integration along characteristic lines as well as from the spatially decomposed Hopscotch method. The set of grid points is initially decomposed into two subsets during the implementation of the integration step. Then, two updates are performed, one explicit and one implicit, on each variable in the course of the iterative process. Each update requires an integration semi step. This is carried out along characteristic lines in a Semi-Lagrangian scheme based on the Modified Method of Characteristics. This work analises two strategies to implement the parallel version of TVD Hopmoc based on the analysis performed by Intel Tools such Parallel and Threading Advisor. A naive solution is substituted by a chunk loop strategy in order to avoid fine-grain tasks inside main loops.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133923991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Caio B. G. Carvalho, V. C. Ferreira, F. França, C. Bentes, Tiago A. O. Alves, A. Sena, L. A. J. Marzulo
{"title":"Towards a Dataflow Runtime Environment for Edge, Fog and In-Situ Computing","authors":"Caio B. G. Carvalho, V. C. Ferreira, F. França, C. Bentes, Tiago A. O. Alves, A. Sena, L. A. J. Marzulo","doi":"10.1109/SBAC-PADW.2017.28","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.28","url":null,"abstract":"In the dataflow computation model, instructions or tasks are fired according to their data dependencies, instead of following program order, thus allowing natural parallelism exploitation. Dataflow has been used, in different flavors and abstraction levels (from processors to runtime libraries), as an interesting alternative for harnessing the potential of modern computing systems. Sucuri is a dataflow library for Python that allows users to specify their application as a dependency graph and execute it transparently at clusters of multicores, while taking care of scheduling issues. Recent trends in Fog and In-situ computing assumes that storage and network devices will be equipped with processing elements that usually have lower power consumption and performance. An important decision on such system is whether to move data to traditional processors (paying the communication costs), or performing computation where data is sitting, using a potentially slower processor. Hence, runtime environments that deal with that trade-off are extremely necessary. This work takes a first step towards a solution that considers Edge/Fog/In-situ in a dataflow runtime. We use Sucuri to manage the execution in a small system with a regular PC and a Parallella board. Experiments with text processing applications running with different input sizes, network latency and packet loss rates allow a discussion of scenarios where this approach would be fruitful.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}