{"title":"Energy Consumption and Scalability Evaluation for Software Transactional Memory on a Real Computing Environment","authors":"T. Rico, M. Pilla, A. R. D. Bois, R. M. Duarte","doi":"10.1109/SBAC-PADW.2015.11","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.11","url":null,"abstract":"Transactional Memory is a concurrent programming abstraction that overcomes several of the limitations found in traditional synchronization mechanisms. As it is a more recent abstraction, little is known about energy consumption of Software Transactional Memories (STM). In this context, this work presents an analysis and characterization of energy consumption and performance of four Transactional Memory libraries: TL2, Tiny STM, Swiss TM, and Adapt STM, using the STAMP benchmarks. Although most works in the state-of-the-art chose to evaluate Transactional Memories through simulation, in this work the benchmarks are run in actual computers, avoiding the known issues with modeling power consumption in simulators. Our results show that Swiss TM is the most efficient library of the four in terms of energy consumption and performance for the default configurations, followed by Adapt STM, Tiny STM, and TL2, for most of the execution scenarios and 8 threads at most. STM's scalability is directly tied to the strategies for detection and resolution of conflicts. In this perspective, Adapt STM is the best STM for applications with short transactions, Swiss TM presents the best results for medium transactions, and long transactions with medium/high contention are best handled by TL2. On the other hand, Tiny STM shows the worst scalability for most scenarios, but with good results for applications with very small abort rates.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"689 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123826997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing Anomalies of a Multicore ARMv7 Cluster with Parallel N-Body Simulations","authors":"J. L. Bez, L. Schnorr, P. Navaux","doi":"10.1109/SBAC-PADW.2015.18","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.18","url":null,"abstract":"ARM processors are beginning to gain attention from the HPC community due to its performance and energy efficiency characteristics. When developing HPC applications for such test beds developers assume that the computation resources available are homogeneous. However, we observed some anomalies when executing a relatively simple HPC application (an NBody simulation). One of the cores in all available nodes presented some variabilities in the computation time. This unexpected behavior was not observed on the second core of each node. In this paper, we aim at characterizing such anomalies, seen in a multicore ARMv7 8-node cluster. We also attempted to isolate and remove all possible interferences that could be contributing to this unexpected behavior, including compilation directives, dynamic processor frequency scaling and communication. Results show that such anomaly might be correlated with the architecture of the dual-core chip. We also analyze the effects of different deployments of MPI process in the total execution time and correlate them to the application and the test bed.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131644069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juliana Zamith, Thiago Silva, Lúcia M. A. Drummond, Cristina Boeres, C. Bentes
{"title":"On the Evaluation of Contention-Aware List Schedulers on Multicore Cluster","authors":"Juliana Zamith, Thiago Silva, Lúcia M. A. Drummond, Cristina Boeres, C. Bentes","doi":"10.1109/SBAC-PADW.2015.19","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.19","url":null,"abstract":"Parallel applications composed of a set of tasks that follow a partial precedence order represent an important class of scientific applications. In high performance computing, environments dedicated to scientific applications are composed of clusters of multicore machines, which consist typically of a set of processing cores that partially share a hierarchy of cache memory. Harnessing the available memory is crucial to achieve good performance in these clusters. This paper proposes strategies based on the list scheduling framework to schedule application tasks on individual cores of multicore clusters. Our idea is to minimize the execution time of the application, by taking into consideration cache contention. Experiments with a representative set of applications show that the scheduling algorithms with contention-aware mechanisms can improve significantly the application performance.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124066017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RadFlow: An Interest-Centric Task Based Dataflow Runtime","authors":"D. Dutra, Heberte F. Moraes, C. Amorim","doi":"10.1109/SBAC-PADW.2015.26","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.26","url":null,"abstract":"We present RadFlow a runtime system for task based Dataflow parallel application using an interest-centric network protocol to data communication among task. The RadNet protocol ability to decouple data destinations from its node IP addresses allows Rad Flow more flexibility, enabling mechanisms like computation migration and elastic tasks to be carried out. We also demonstrate how to create a Bag-of-Task, a fork/join, as well as an elastic fork/join Dataflow parallel application for the Rad Flow runtime. Furthermore, an elastic Dataflow application provides the application developer means to cope with the failures rates in future Exascale environments.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128442021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Yuan, R. Boyapati, Lei Wang, Hyunjun Jang, Yuho Jin, K. H. Yum, Eun Jung Kim
{"title":"Intra-Clustering: Accelerating On-chip Communication for Data Parallel Architectures","authors":"Wen Yuan, R. Boyapati, Lei Wang, Hyunjun Jang, Yuho Jin, K. H. Yum, Eun Jung Kim","doi":"10.1109/SBAC-PADW.2015.15","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.15","url":null,"abstract":"Modern computation workloads contain abundant Data Level Parallelism (DLP), which requires specialized data parallel architectures, such as Graphics Processing Units (GPUs). With parallel programming models, such as CUDA and OpenCL, GPUs are easily to be programmed for non-graphics applications, and therefore become a cost effective approach for data parallel architectures. The large quantity of available parallelism places a heavy stress on the memory system as the limited number of pins confines the number of memory controllers on the chip. This creates a potential bottleneck for performance scalability of the GPUs. To accelerate communication with the memory system, we propose the Intra-Clustering on-chip network for data parallel architectures, which is built upon a traditional two-dimensional electrical mesh network with memory controllers connected through a nanophotonic ring and compute cores grouped into different clusters. Our evaluations with CUDA benchmarks show that the Intra-Clustering architecture can improve communication delay by an average of 17% (up to 32%) and IPC by an average of 5% (up to 11.5%).","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126868365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Sena, Eduardo S. Vaz, F. França, L. A. J. Marzulo, Tiago A. O. Alves
{"title":"Graph Templates for Dataflow Programming","authors":"A. Sena, Eduardo S. Vaz, F. França, L. A. J. Marzulo, Tiago A. O. Alves","doi":"10.1109/SBAC-PADW.2015.20","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.20","url":null,"abstract":"Current works on parallel programming models are trending towards the dataflow paradigm, which naturally exploits parallelism in programs. The Sucuri Python Library provides basic features for creation and execution of dataflow graphs in parallel environments. However, there is still a gap between dataflow programming and traditional parallel programming. In this paper we aim at narrowing that gap by introducing a set of templates for Sucuri that represent some of the most important parallel programming patterns. Through these templates programmers can implement applications that use patterns such as fork/join, pipeline and wave front just by instantiating and connecting sub-graph objects. Evaluation showed that the use of templates makes programming easier, while allowing a significant reduction in lines of code, compared to manually creating the dataflow graph.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123522341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Israel da Silva Barbara, Nicolas O. de Araujo, A. R. D. Bois, G. H. Cavalheiro
{"title":"Evaluating Overhead and Contention in Concurrent Accesses to a Graph","authors":"Israel da Silva Barbara, Nicolas O. de Araujo, A. R. D. Bois, G. H. Cavalheiro","doi":"10.1109/SBAC-PADW.2015.27","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.27","url":null,"abstract":"The current spread of multicore processors reinforces the need for strategies to implement mutithreaded programs. Since using synchronization methods to coordinate the access to shared data introduces contention, finding new strategies to implement concurrent data structures can lead to performance gains. This paper introduces a case study in which a graph data structure is implemented using low contention strategies: one based on low level atomic operations, one based on mutexes and another using transactional memory. Results show that the first presents better performance, the second the worst performance and the later a higher level of abstraction for programmers with a similar performance to the first.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127986574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui R. Mello Junior, Rubens H. P. de Almeida, F. França, G. Paillard
{"title":"A Parallel Implementation of Data Fusion Algorithm Using Gamma","authors":"Rui R. Mello Junior, Rubens H. P. de Almeida, F. França, G. Paillard","doi":"10.1109/SBAC-PADW.2015.25","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.25","url":null,"abstract":"In this paper we carried out designing and implementing of a target tracking data fusion algorithm based on a two stages graph solution using the computational model Gamma (General Abstract Model for Multiset mAnipulation). The proposed solution is the first parallel implementation of the method PPTS (Pairs of Plots in Two Stages). For this, we employed three Gamma implementations, where two of them exploited the resources of a parallel hardware environment, one using the MPI (Message Passing Interface) and the other one GPU (Graphics Processing Unit). Thus, the studied algorithm was evaluated from the parallelism exploited and finally was carried out a performance analysis of this algorithm in the three Gamma implementations used. The aim of this study is to provide an implementation on a real problem using for this the paradigm Gamma, which contributes to the implementations of the Gamma computational model, since it enables the performance analysis of these implementations and provides some suggestions for possible improvements. In addition, this work contributes to the PPTS method since it provides the parallelization of the first stage.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brunno F. Goldstein, F. França, L. A. J. Marzulo, Tiago A. O. Alves
{"title":"Exploiting Parallelism in Linear Algebra Kernels through Dataflow Execution","authors":"Brunno F. Goldstein, F. França, L. A. J. Marzulo, Tiago A. O. Alves","doi":"10.1109/SBAC-PADW.2015.21","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.21","url":null,"abstract":"Linear Algebra Kernels have an important role in many petroleum reservoir simulators, extensively used by the industry. The growth in problem size, specially in pre-salt exploration, has caused an increase in execution time of those kernels, thus requiring parallel programming to improve performance and make the simulation viable. On the other hand, exploiting parallelism in systems with an ever increasing number of cores may be an arduous task, as the programmer has to manage threads and care about synchronization issues. Current work on parallel programming models show that Dataflow Execution exploits parallelism in a natural way, allowing the programmer to focus solely on describing dependencies between portions of code. This work consists in implementing parallel Linear Algebra Kernels using the Dataflow model. The Trebuchet Dataflow Virtual Machine and the Sucuri Dataflow Library were used to evaluate the solutions with real inputs from reservoir simulators. Results have been compared with OpenMP and Intel Math Kernel Library and show that coarser-grained tasks are needed to hide the overheads of dataflow runtime environments. Therefore, level 2 and 3 linear algebra operations, such as Vector-Matrix and Matrix-Matrix products, presented the most promising results.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126177563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé
{"title":"Impact of Version Management on Transactional Memories' Performance","authors":"Felipe L. Teixeira, M. Pilla, A. R. D. Bois, D. Mossé","doi":"10.1109/SBAC-PADW.2015.14","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2015.14","url":null,"abstract":"Software Transactional Memory (STM) is a synchronization method proposed as an alternative to lock-based synchronization. It provides a higher-level of abstraction that is easier to program, and that enables software composition. Transactions are defined by programmers, but the runtime system is responsible for detecting conflicts and avoiding race conditions. One of the design axis in STMs is how version management is implemented in order to secure atomicity. There are two type of version management: Eager Versioning and Lazy Versioning. In this work, we evaluate the version management options implemented in Tiny STM through an orthogonal analysis and performance evaluation.","PeriodicalId":161685,"journal":{"name":"2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW)","volume":"41 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120914697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}