Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu
{"title":"在 SIMD 架构上使用三种张量布局实现高性能 Im2win 和直接卷积","authors":"Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu","doi":"arxiv-2408.00278","DOIUrl":null,"url":null,"abstract":"Convolution is the core component within deep neural networks and it is\ncomputationally intensive and time consuming. Tensor data layouts significantly\nimpact convolution operations in terms of memory access and computational\nefficiency. Yet, there is still a lack of comprehensive performance\ncharacterization on data layouts on SIMD architectures concerning convolution\nmethods. This paper proposes three novel data layouts for im2win convolution:\nNHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques\nfor both direct and im2win convolutions. We compare the optimized im2win\nconvolution with the direct convolution and PyTorch's im2col-based convolution\nacross the aforementioned layouts on SIMD machines. The experiments\ndemonstrated that the im2win convolution with the new NHWC layout achieved up\nto 355% performance speedup over NCHW layout. Our optimizations also\nsignificantly improve the performance of both im2win and direct convolutions.\nOur optimized im2win and direct convolutions achieved up to 95% and 94% of\nmachine's theoretical peak performance, respectively.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"98 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures\",\"authors\":\"Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu\",\"doi\":\"arxiv-2408.00278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolution is the core component within deep neural networks and it is\\ncomputationally intensive and time consuming. Tensor data layouts significantly\\nimpact convolution operations in terms of memory access and computational\\nefficiency. Yet, there is still a lack of comprehensive performance\\ncharacterization on data layouts on SIMD architectures concerning convolution\\nmethods. This paper proposes three novel data layouts for im2win convolution:\\nNHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques\\nfor both direct and im2win convolutions. We compare the optimized im2win\\nconvolution with the direct convolution and PyTorch's im2col-based convolution\\nacross the aforementioned layouts on SIMD machines. The experiments\\ndemonstrated that the im2win convolution with the new NHWC layout achieved up\\nto 355% performance speedup over NCHW layout. Our optimizations also\\nsignificantly improve the performance of both im2win and direct convolutions.\\nOur optimized im2win and direct convolutions achieved up to 95% and 94% of\\nmachine's theoretical peak performance, respectively.\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"98 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00278\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures
Convolution is the core component within deep neural networks and it is
computationally intensive and time consuming. Tensor data layouts significantly
impact convolution operations in terms of memory access and computational
efficiency. Yet, there is still a lack of comprehensive performance
characterization on data layouts on SIMD architectures concerning convolution
methods. This paper proposes three novel data layouts for im2win convolution:
NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques
for both direct and im2win convolutions. We compare the optimized im2win
convolution with the direct convolution and PyTorch's im2col-based convolution
across the aforementioned layouts on SIMD machines. The experiments
demonstrated that the im2win convolution with the new NHWC layout achieved up
to 355% performance speedup over NCHW layout. Our optimizations also
significantly improve the performance of both im2win and direct convolutions.
Our optimized im2win and direct convolutions achieved up to 95% and 94% of
machine's theoretical peak performance, respectively.