{"title":"High Performance by Exploiting Information Locality through Reverse Computing","authors":"Mouad Bahi, C. Eisenbeis","doi":"10.1109/SBAC-PAD.2011.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.10","url":null,"abstract":"In this paper we present performance results for our register rematerialization technique based on reverse recomputing. Rematerialization adds instructions and we show on one specifically designed example that reverse computing alleviates the impact of these additional instructions on performance. We also show how thread parallelism may be optimized on GPUs by performing register allocation with reverse recomputing that increases the number of threads per Streaming Multiprocessor (SM). This is done on the main kernel of Lattice Quantum Chromo Dynamics (LQCD) simulation program where we gain a 10.84% speedup.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128108810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip","authors":"Turbo Majumder, P. Pande, A. Kalyanaraman","doi":"10.1109/SBAC-PAD.2011.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.17","url":null,"abstract":"Probability-based approaches for phylogenetic inference, like Maximum Likelihood (ML) and Bayesian Inference, provide the most accurate estimate of evolutionary relationships among species. But they come at a high algorithmic and computational cost. Network-on-chip (NoC), being an emerging paradigm, has not been explored yet to achieve fine-grained parallelism for these applications. In this paper, we present the design and performance evaluation of an NoC architecture for RAxML, which is one of the most widely used ML software suites. Specifically, we implement the top three function kernels that account for more than 85% of the total run-time. Simulations show that through novel core design, allocation and placement strategies our NoC-based implementation can achieve function-level speedups of 388x to 786x and system-level speedups in excess of 5000x over state-of-the-art multithreaded software.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130146448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble
{"title":"Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++","authors":"Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble","doi":"10.1109/SBAC-PAD.2011.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.16","url":null,"abstract":"Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high computing and memory requirements, SW is usually executed on HPC platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application between them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to reduce the effort to write parallel code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW in multiple platforms like MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE. The results obtained with real DNA sequences show that the performance of our versions is comparable to the ones in the literature, evidencing the appropriateness and flexibility of our approach.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132096535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam
{"title":"FAIRIO: An Algorithm for Differentiated I/O Performance","authors":"Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam","doi":"10.1109/SBAC-PAD.2011.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.26","url":null,"abstract":"Providing differentiated service in a consolidated storage environment is a challenging task. To address this problem, we introduce FAIRIO, a cycle-based I/O scheduling algorithm that provides differentiated service to workloads concurrently accessing a consolidated RAID storage system. FAIRIO enforces proportional sharing of I/O service through fair scheduling of disk time. During each cycle of the algorithm, I/O requests are scheduled according to workload weights and disk-time utilization history. Experiments, which were driven by the I/O request streams of real and synthetic I/O benchmarks and run on a modified version of DiskSim, provide evidence of FAIRIO's effectiveness and demonstrate that fair scheduling of disk time is key to achieving differentiated service. In particular, the experimental results show that, for a broad range of workload request types, sizes, and access characteristics, the algorithm provides differentiated storage throughput that is within 10% of being perfectly proportional to workload weights, and, it achieves this with little or no degradation of aggregate throughput. The core design concepts of FAIRIO, including service-time allocation and history-driven compensation, potentially can be used to design I/O scheduling algorithms that provide workloads with differentiated service in storage systems comprised of RAIDs, multiple RAIDs, SANs, and hypervisors for Clouds.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125914282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speeding Up Learning in Real-Time Search through Parallel Computing","authors":"Vinícius Marques, L. Chaimowicz, R. Ferreira","doi":"10.1109/SBAC-PAD.2011.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.30","url":null,"abstract":"Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125398898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß
{"title":"Why Online Dynamic Mesh Refinement is Better for Parallel Climatological Models","authors":"C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß","doi":"10.1109/SBAC-PAD.2011.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.14","url":null,"abstract":"Forecast precisions of climatological models are limited by computing power and time available for the executions. As more and faster processors are used in the computation, the resolution of the mesh adopted to represent the Earth's atmosphere can be increased, and consequently the numerical forecast is more accurate and shows local phenomena. However, a finer mesh resolution, able to include local phenomena in a global atmosphere integration, is still not possible. To overcome this situation, different mesh refinement levels can be used at the same time for different areas. In this context, this paper evaluates how mesh refinement at run time can improve performance for climatological models. In order to contribute with this analysis, an online dynamic mesh refinement was developed. It increases mesh resolution in parts of a parallel distributed model, when special atmosphere conditions are registered during the execution. The results show that the parallel execution of this improvement provides better resolution for the meshes, without a significant increase of execution time.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau
{"title":"Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms","authors":"M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau","doi":"10.1109/SBAC-PAD.2011.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.21","url":null,"abstract":"Matrix multiply is ubiquitous in scientific computing. Considerable effort has been spent on improving its performance. Once methods that make efficient use of the processor have been exhausted, methods that use less operations than the canonical matrix multiply must be explored. Combining the two methods yields a hybrid matrix multiply algorithm. Hybrid matrix multiply algorithms tend to be less accurate than the canonical matrix multiply implementation, leaving room for improvement. There are well-known techniques for improving accuracy, but they tend to be slow and it is not immediately obvious how best to apply them to hybrid algorithms without lowering performance. Previous attempts have focused on the bottom of the hybrid matrix multiply algorithm, modifying the high-performance matrix multiply implementation. In contrast, the top-down approach presented here does not require the modification of the high-performance matrix multiply implementation at the bottom, nor does it require modification of the fast asymptotic matrix multiply algorithm at the top. The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133151306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos
{"title":"A New Parallel Schema for Branch-and-Bound Algorithms Using GPGPU","authors":"T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos","doi":"10.1109/SBAC-PAD.2011.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.20","url":null,"abstract":"This work presents a new parallel procedure designed to process combinatorial B&B algorithms using GPGPU. In our schema we dispatch a number of threads that treats intelligently the massively parallel processors of NVIDIA GeForce graphical units. The strategy is to build sequentially a series of initial searches that can map a subspace of the B&B tree by starting a number of limited threads after achieving a specific level of the tree. The search is then processed massively by DFS. The whole subspace is optimized accordingly to memory and limits of threads and blocks available by the GPU. We compare our results with its OpenMP and Serial versions of the same search schema using explicitly enumeration (all possible solutions) to the Asymmetrical Travelling Salesman Problem's instances. We also show the great superiority of our GPGPU based method.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129227408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure-Constrained Microcode Compression","authors":"E. Borin, G. Araújo, M. Breternitz, Youfeng Wu","doi":"10.1109/SBAC-PAD.2011.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.32","url":null,"abstract":"Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. One solution to address the microcode size issue is to store the microcode in a compressed form and decompress it during execution. Furthermore, the reuse of a single hardware building block layout to implement different dictionaries in the two-level microcode compression reduces the cost and the design time of the decompression engine. However, the reuse of the hardware building block imposes structural constraints to the compression algorithm, and existing algorithms may yield poor compression. In this paper, we develop the SC2 algorithm that considers the structural constraint in its objective function and reduces the area expansion when reusing hardware building blocks to implement different dictionaries. Our experimental results show that the SC2 algorithm is able to produce similar sized dictionaries and achieves the similar compression ratio to the non-constrained algorithm.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131502766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classification and Elimination of Conflicts in Hardware Transactional Memory Systems","authors":"M. Waliullah, P. Stenström","doi":"10.1109/SBAC-PAD.2011.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.18","url":null,"abstract":"This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130596394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}