通过大规模多头注意实现的低延迟视觉变形

IF 3.1 3区物理与天体物理 Q2 PHYSICS, MULTIDISCIPLINARY

Physica A: Statistical Mechanics and its Applications Pub Date : 2025-07-22 DOI:10.1016/j.physa.2025.130835

Ronit D. Gross , Tal Halevi , Ella Koresh , Yarden Tzach , Ido Kanter

{"title":"通过大规模多头注意实现的低延迟视觉变形","authors":"Ronit D. Gross , Tal Halevi , Ella Koresh , Yarden Tzach , Ido Kanter","doi":"10.1016/j.physa.2025.130835","DOIUrl":null,"url":null,"abstract":"<div><div>The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.</div></div>","PeriodicalId":20152,"journal":{"name":"Physica A: Statistical Mechanics and its Applications","volume":"675 ","pages":"Article 130835"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-latency vision transformers via large-scale multi-head attention\",\"authors\":\"Ronit D. Gross , Tal Halevi , Ella Koresh , Yarden Tzach , Ido Kanter\",\"doi\":\"10.1016/j.physa.2025.130835\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.</div></div>\",\"PeriodicalId\":20152,\"journal\":{\"name\":\"Physica A: Statistical Mechanics and its Applications\",\"volume\":\"675 \",\"pages\":\"Article 130835\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Physica A: Statistical Mechanics and its Applications\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S037843712500487X\",\"RegionNum\":3,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PHYSICS, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physica A: Statistical Mechanics and its Applications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S037843712500487X","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

最近，通过对单节点表现（SNP）的量化，研究人员证明了分类任务中多个多头注意（MHA）的几个头之间自发对称性破缺的出现。这一发现表明，每个头通过其snp之间的合作将其注意力集中在标签的一个子集上。这种潜在的学习机制被推广到大规模MHA (LS-MHA)，使用单个矩阵值表示单头性能（SHP），类似于卷积神经网络（cnn）中的单滤波器性能。结果表明，每个SHP矩阵由多个单元簇组成，使得每个标签可以被几个头明确识别，噪声可以忽略不计。这导致沿变压器块增加信噪比（SNR），从而提高分类精度。这些特征产生了几种不同的视觉变压器（ViT）架构，它们实现了相同的精度，但其LS-MHA结构不同。因此，他们的软委员会产生了更高的准确性，这在依赖数百个过滤器的cnn中通常观察不到。此外，通过用卷积层替换初始变压器块，可以在不影响精度的情况下显著降低延迟。这种替代加速了早期阶段的学习，然后通过后续的变压器层改进。将这种学习机制扩展到自然语言处理任务，基于cnn和ViT架构之间的定量差异，有可能在深度学习中产生新的见解。使用在CIFAR-100数据集上训练的紧凑卷积变压器架构证明了这些发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Low-latency vision transformers via large-scale multi-head attention

The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Physica A: Statistical Mechanics and its Applications 物理-物理：综合

CiteScore

7.20

自引率

9.10%

发文量

852

审稿时长

6.6 months

期刊介绍： Physica A: Statistical Mechanics and its Applications Recognized by the European Physical Society Physica A publishes research in the field of statistical mechanics and its applications. Statistical mechanics sets out to explain the behaviour of macroscopic systems by studying the statistical properties of their microscopic constituents. Applications of the techniques of statistical mechanics are widespread, and include: applications to physical systems such as solids, liquids and gases; applications to chemical and biological systems (colloids, interfaces, complex fluids, polymers and biopolymers, cell physics); and other interdisciplinary applications to for instance biological, economical and sociological systems.