超级计算机上的大规模克罗内克产品

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) Pub Date : 2011-10-26 DOI:10.1109/WAMCA.2011.10

C. Tadonki

{"title":"超级计算机上的大规模克罗内克产品","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":null,"url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Large Scale Kronecker Product on Supercomputers\",\"authors\":\"C. Tadonki\",\"doi\":\"10.1109/WAMCA.2011.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.\",\"PeriodicalId\":380586,\"journal\":{\"name\":\"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WAMCA.2011.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAMCA.2011.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

Kronecker积，也称为张量积，是一种基本的矩阵代数运算，它被广泛用作表达许多相互作用或表示的卷积的自然形式。给定一组矩阵，我们需要用一个向量乘以它们的克罗内克积。该运算是迭代算法的关键核心，需要高效计算。在之前的工作中，我们提出了一个成本最优的并行算法来解决这个问题，无论是在浮点计算时间和处理器间通信步骤方面。然而，只有当我们真正考虑(本地)对数广播时，才能实现数据传输的下限。在实践中，我们考虑一个通信回路。因此，关注每次广播的实际成本变得非常重要。由于这种本地广播是由每个处理器同时执行的，因此在大量处理器(超级计算机)上，情况变得越来越糟。本文从两个方面来解决这个问题。一方面，我们提出了一种构造与理论下界差距最小的虚拟拓扑的方法。另一方面，我们考虑一种混合实现，它的优点是减少了通信节点的数量。我们用大型SMP 8核超级计算机上的一些基准测试来说明我们的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Scale Kronecker Product on Supercomputers

The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)

自引率

0.00%

发文量