使用长SIMD指令的高效直接卷积

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI:10.1145/3572848.3577435

Alexandre de Limas Santana, Adrià Armejach, Marc Casas

{"title":"使用长SIMD指令的高效直接卷积","authors":"Alexandre de Limas Santana, Adrià Armejach, Marc Casas","doi":"10.1145/3572848.3577435","DOIUrl":null,"url":null,"abstract":"This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"145 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Efficient Direct Convolution Using Long SIMD Instructions\",\"authors\":\"Alexandre de Limas Santana, Adrià Armejach, Marc Casas\",\"doi\":\"10.1145/3572848.3577435\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.\",\"PeriodicalId\":233744,\"journal\":{\"name\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"145 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3572848.3577435\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

本文证明了在支持SIMD指令的cpu架构上计算卷积的最新建议，由于频繁的缓存冲突失败，对于长SIMD长度的cpu来说，性能很差。我们首先讨论了如何使最先进的SIMD直接卷积适应使用长SIMD指令的体系结构，并分析了增加SIMD长度对算法公式的影响。接下来，我们提出了两种新的算法方法:有界直接卷积(BDC)和多块直接卷积(MBDC)，前者调整了暴露的计算量以减轻缓存丢失，后者重新定义了激活存储器布局以改进存储器访问模式。我们在使用ResNet卷积的具有16,384位SIMD寄存器的cpu架构上评估了BDC, MBDC，最先进的技术和专有库。我们的结果表明，与最先进的ResNet-101技术相比，BDC和MBDC分别实现了1.44倍和1.28倍的加速，与专有库相比，分别实现了1.83倍和1.63倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Direct Convolution Using Long SIMD Instructions

This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量