2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)最新文献

筛选
英文 中文
Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs 在消费级CUDA gpu上实现经济的双重工作精度矩阵乘法
N. Fujimoto
{"title":"Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs","authors":"N. Fujimoto","doi":"10.1109/WAMCA.2011.18","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.18","url":null,"abstract":"Dot product faithfully rounded after \"as if\" computed in $K$-fold working precision (K>=2)is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard.This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method.Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 timeshigher GFLOPS performance in two-fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU.The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Large Scale Kronecker Product on Supercomputers 超级计算机上的大规模克罗内克产品
C. Tadonki
{"title":"Large Scale Kronecker Product on Supercomputers","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.10","url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy 基于重映射策略的片上SNUCA缓存的自适应功耗优化
A. Mandke, B. Amrutur, Y. Srikant
{"title":"Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy","authors":"A. Mandke, B. Amrutur, Y. Srikant","doi":"10.1109/WAMCA.2011.14","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.14","url":null,"abstract":"Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage powerconsumption of on-chip caches has already become a major power consuming component of the memory subsystem.  We propose to reduce leakage powerconsumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unusedcache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communicationcosts to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, theremap policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122828560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications 面向并行应用的片上网络进程映射问题的评价
Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux
{"title":"Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications","authors":"Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux","doi":"10.1109/WAMCA.2011.13","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.13","url":null,"abstract":"Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"305 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133716981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines 基于跟踪的可视化是一种理解多核机器中应用程序I/O性能的工具
Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta
{"title":"Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines","authors":"Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta","doi":"10.1109/WAMCA.2011.12","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.12","url":null,"abstract":"This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paj´e. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123085245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信