2011 23rd International Symposium on Computer Architecture and High Performance Computing最新文献

筛选
英文 中文
High Performance by Exploiting Information Locality through Reverse Computing 通过反向计算利用信息局部性实现高性能
Mouad Bahi, C. Eisenbeis
{"title":"High Performance by Exploiting Information Locality through Reverse Computing","authors":"Mouad Bahi, C. Eisenbeis","doi":"10.1109/SBAC-PAD.2011.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.10","url":null,"abstract":"In this paper we present performance results for our register rematerialization technique based on reverse recomputing. Rematerialization adds instructions and we show on one specifically designed example that reverse computing alleviates the impact of these additional instructions on performance. We also show how thread parallelism may be optimized on GPUs by performing register allocation with reverse recomputing that increases the number of threads per Streaming Multiprocessor (SM). This is done on the main kernel of Lattice Quantum Chromo Dynamics (LQCD) simulation program where we gain a 10.84% speedup.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128108810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip 利用片上网络加速基于最大似然的系统发育核
Turbo Majumder, P. Pande, A. Kalyanaraman
{"title":"Accelerating Maximum Likelihood Based Phylogenetic Kernels Using Network-on-Chip","authors":"Turbo Majumder, P. Pande, A. Kalyanaraman","doi":"10.1109/SBAC-PAD.2011.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.17","url":null,"abstract":"Probability-based approaches for phylogenetic inference, like Maximum Likelihood (ML) and Bayesian Inference, provide the most accurate estimate of evolutionary relationships among species. But they come at a high algorithmic and computational cost. Network-on-chip (NoC), being an emerging paradigm, has not been explored yet to achieve fine-grained parallelism for these applications. In this paper, we present the design and performance evaluation of an NoC architecture for RAxML, which is one of the most widely used ML software suites. Specifically, we implement the top three function kernels that account for more than 85% of the total run-time. Simulations show that through novel core design, allocation and placement strategies our NoC-based implementation can achieve function-level speedups of 388x to 786x and system-level speedups in excess of 5000x over state-of-the-art multithreaded software.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130146448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++ 基于BSP++的异构高性能计算平台上并行生物序列比较
Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble
{"title":"Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++","authors":"Khaled Hamidouche, F. Mendonca, J. Falcou, D. Etiemble","doi":"10.1109/SBAC-PAD.2011.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.16","url":null,"abstract":"Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high computing and memory requirements, SW is usually executed on HPC platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application between them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to reduce the effort to write parallel code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW in multiple platforms like MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE. The results obtained with real DNA sequences show that the performance of our versions is comparable to the ones in the literature, evidencing the appropriateness and flexibility of our approach.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132096535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FAIRIO: An Algorithm for Differentiated I/O Performance 一种差分I/O性能算法
Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam
{"title":"FAIRIO: An Algorithm for Differentiated I/O Performance","authors":"Sarala Arunagiri, Yipkei Kwok, P. Teller, Ricardo Portillo, Seetharami R. Seelam","doi":"10.1109/SBAC-PAD.2011.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.26","url":null,"abstract":"Providing differentiated service in a consolidated storage environment is a challenging task. To address this problem, we introduce FAIRIO, a cycle-based I/O scheduling algorithm that provides differentiated service to workloads concurrently accessing a consolidated RAID storage system. FAIRIO enforces proportional sharing of I/O service through fair scheduling of disk time. During each cycle of the algorithm, I/O requests are scheduled according to workload weights and disk-time utilization history. Experiments, which were driven by the I/O request streams of real and synthetic I/O benchmarks and run on a modified version of DiskSim, provide evidence of FAIRIO's effectiveness and demonstrate that fair scheduling of disk time is key to achieving differentiated service. In particular, the experimental results show that, for a broad range of workload request types, sizes, and access characteristics, the algorithm provides differentiated storage throughput that is within 10% of being perfectly proportional to workload weights, and, it achieves this with little or no degradation of aggregate throughput. The core design concepts of FAIRIO, including service-time allocation and history-driven compensation, potentially can be used to design I/O scheduling algorithms that provide workloads with differentiated service in storage systems comprised of RAIDs, multiple RAIDs, SANs, and hypervisors for Clouds.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125914282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Speeding Up Learning in Real-Time Search through Parallel Computing 通过并行计算加速实时搜索中的学习
Vinícius Marques, L. Chaimowicz, R. Ferreira
{"title":"Speeding Up Learning in Real-Time Search through Parallel Computing","authors":"Vinícius Marques, L. Chaimowicz, R. Ferreira","doi":"10.1109/SBAC-PAD.2011.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.30","url":null,"abstract":"Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125398898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Why Online Dynamic Mesh Refinement is Better for Parallel Climatological Models 为什么在线动态网格细化对并行气候模型更好
C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß
{"title":"Why Online Dynamic Mesh Refinement is Better for Parallel Climatological Models","authors":"C. Schepke, N. Maillard, Jörg Schneider, Hans-Ulrich Heiß","doi":"10.1109/SBAC-PAD.2011.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.14","url":null,"abstract":"Forecast precisions of climatological models are limited by computing power and time available for the executions. As more and faster processors are used in the computation, the resolution of the mesh adopted to represent the Earth's atmosphere can be increased, and consequently the numerical forecast is more accurate and shows local phenomena. However, a finer mesh resolution, able to include local phenomena in a global atmosphere integration, is still not possible. To overcome this situation, different mesh refinement levels can be used at the same time for different areas. In this context, this paper evaluates how mesh refinement at run time can improve performance for climatological models. In order to contribute with this analysis, an online dynamic mesh refinement was developed. It increases mesh resolution in parts of a parallel distributed model, when special atmosphere conditions are registered during the execution. The results show that the parallel execution of this improvement provides better resolution for the meshes, without a significant increase of execution time.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms 使用自适应阻塞算法提高高性能BLAS实现的准确性
M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau
{"title":"Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms","authors":"M. Badin, P. D'Alberto, L. Bic, M. Dillencourt, A. Nicolau","doi":"10.1109/SBAC-PAD.2011.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.21","url":null,"abstract":"Matrix multiply is ubiquitous in scientific computing. Considerable effort has been spent on improving its performance. Once methods that make efficient use of the processor have been exhausted, methods that use less operations than the canonical matrix multiply must be explored. Combining the two methods yields a hybrid matrix multiply algorithm. Hybrid matrix multiply algorithms tend to be less accurate than the canonical matrix multiply implementation, leaving room for improvement. There are well-known techniques for improving accuracy, but they tend to be slow and it is not immediately obvious how best to apply them to hybrid algorithms without lowering performance. Previous attempts have focused on the bottom of the hybrid matrix multiply algorithm, modifying the high-performance matrix multiply implementation. In contrast, the top-down approach presented here does not require the modification of the high-performance matrix multiply implementation at the bottom, nor does it require modification of the fast asymptotic matrix multiply algorithm at the top. The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133151306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A New Parallel Schema for Branch-and-Bound Algorithms Using GPGPU 基于GPGPU的分支定界算法并行架构
T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos
{"title":"A New Parallel Schema for Branch-and-Bound Algorithms Using GPGPU","authors":"T. Carneiro, A. Muritiba, Marcos Negreiros, G. Campos","doi":"10.1109/SBAC-PAD.2011.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.20","url":null,"abstract":"This work presents a new parallel procedure designed to process combinatorial B&B algorithms using GPGPU. In our schema we dispatch a number of threads that treats intelligently the massively parallel processors of NVIDIA GeForce graphical units. The strategy is to build sequentially a series of initial searches that can map a subspace of the B&B tree by starting a number of limited threads after achieving a specific level of the tree. The search is then processed massively by DFS. The whole subspace is optimized accordingly to memory and limits of threads and blocks available by the GPU. We compare our results with its OpenMP and Serial versions of the same search schema using explicitly enumeration (all possible solutions) to the Asymmetrical Travelling Salesman Problem's instances. We also show the great superiority of our GPGPU based method.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129227408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Structure-Constrained Microcode Compression 结构约束的微码压缩
E. Borin, G. Araújo, M. Breternitz, Youfeng Wu
{"title":"Structure-Constrained Microcode Compression","authors":"E. Borin, G. Araújo, M. Breternitz, Youfeng Wu","doi":"10.1109/SBAC-PAD.2011.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.32","url":null,"abstract":"Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. One solution to address the microcode size issue is to store the microcode in a compressed form and decompress it during execution. Furthermore, the reuse of a single hardware building block layout to implement different dictionaries in the two-level microcode compression reduces the cost and the design time of the decompression engine. However, the reuse of the hardware building block imposes structural constraints to the compression algorithm, and existing algorithms may yield poor compression. In this paper, we develop the SC2 algorithm that considers the structural constraint in its objective function and reduces the area expansion when reusing hardware building blocks to implement different dictionaries. Our experimental results show that the SC2 algorithm is able to produce similar sized dictionaries and achieves the similar compression ratio to the non-constrained algorithm.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131502766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification and Elimination of Conflicts in Hardware Transactional Memory Systems 硬件事务性存储系统中冲突的分类与消除
M. Waliullah, P. Stenström
{"title":"Classification and Elimination of Conflicts in Hardware Transactional Memory Systems","authors":"M. Waliullah, P. Stenström","doi":"10.1109/SBAC-PAD.2011.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.18","url":null,"abstract":"This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130596394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信