2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)最新文献

Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs 在消费级CUDA gpu上实现经济的双重工作精度矩阵乘法

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-26 DOI: 10.1109/WAMCA.2011.18

N. Fujimoto

引用次数: 2

Large Scale Kronecker Product on Supercomputers 超级计算机上的大规模克罗内克产品

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-26 DOI: 10.1109/WAMCA.2011.10

C. Tadonki

{"title":"Large Scale Kronecker Product on Supercomputers","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":"https://doi.org/10.1109/WAMCA.2011.10","url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy 基于重映射策略的片上SNUCA缓存的自适应功耗优化

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-26 DOI: 10.1109/WAMCA.2011.14

A. Mandke, B. Amrutur, Y. Srikant

引用次数: 16

Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications 面向并行应用的片上网络进程映射问题的评价

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-26 DOI: 10.1109/WAMCA.2011.13

Cíntia P. Avelar, Poliana A. C. Oliveira, H. Freitas, P. Navaux

引用次数: 2

Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines 基于跟踪的可视化是一种理解多核机器中应用程序I/O性能的工具

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-01 DOI: 10.1109/WAMCA.2011.12

Rodrigo Kassick, F. Boito, M. Diener, P. Navaux, Y. Denneulin, C. Schepke, N. Maillard, Carla Osthoff, P. Grunmann, P. Dias, J. Panetta

引用次数: 2