Learning Cross-Attention Point Transformer With Global Porous Sampling

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-11-01 DOI:10.1109/TIP.2024.3486612

Yueqi Duan;Haowen Sun;Juncheng Yan;Jiwen Lu;Jie Zhou

{"title":"Learning Cross-Attention Point Transformer With Global Porous Sampling","authors":"Yueqi Duan;Haowen Sun;Juncheng Yan;Jiwen Lu;Jie Zhou","doi":"10.1109/TIP.2024.3486612","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy. However, FPS only generates sub-clouds with holistic structures, which fails to fully exploit the flexibility of points to generate diversified tokens for the attention module. To address this, we design a cross-attention module with parametric GPS and Complementary GPS (C-GPS) strategies to generate series of diversified tokens through controllable parameters. We show that FPS is a degenerated case of GPS, and the network learns more abundant relational information of the structure and geometry when we perform consecutive cross-attention over the tokens generated by GPS as well as C-GPS sampled points. More specifically, we set evenly-sampled points as queries and design our cross-attention layers with GPS and C-GPS sampled points as keys and values. In order to further improve the diversity of tokens, we design a deformable operation over points to adaptively adjust the points according to the input. Extensive experimental results on both shape classification and indoor scene segmentation tasks indicate promising boosts over the recent point cloud transformers. We also conduct ablation studies to show the effectiveness of our proposed cross-attention module with GPS strategy.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6283-6297"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10740603/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy. However, FPS only generates sub-clouds with holistic structures, which fails to fully exploit the flexibility of points to generate diversified tokens for the attention module. To address this, we design a cross-attention module with parametric GPS and Complementary GPS (C-GPS) strategies to generate series of diversified tokens through controllable parameters. We show that FPS is a degenerated case of GPS, and the network learns more abundant relational information of the structure and geometry when we perform consecutive cross-attention over the tokens generated by GPS as well as C-GPS sampled points. More specifically, we set evenly-sampled points as queries and design our cross-attention layers with GPS and C-GPS sampled points as keys and values. In order to further improve the diversity of tokens, we design a deformable operation over points to adaptively adjust the points according to the input. Extensive experimental results on both shape classification and indoor scene segmentation tasks indicate promising boosts over the recent point cloud transformers. We also conduct ablation studies to show the effectiveness of our proposed cross-attention module with GPS strategy.

查看原文本刊更多论文

利用全局多孔采样学习交叉注意点变换器

在本文中，我们提出了一种基于点的交叉注意力转换器，名为 CrossPoints，采用参数化全局多孔采样（GPS）策略。注意模块对于捕捉变换器中不同标记之间的相关性至关重要。现有的基于点的变换器大多采用广泛使用的最远点采样（FPS）策略，利用向下采样的点云设计多尺度自关注操作。然而，FPS 只能生成具有整体结构的子云，无法充分利用点的灵活性为注意力模块生成多样化的标记。针对这一问题，我们设计了一种交叉注意力模块，采用参数化 GPS 和互补 GPS（C-GPS）策略，通过可控参数生成一系列多样化标记。我们的研究表明，FPS 是 GPS 的一种退化情况，当我们对 GPS 和 C-GPS 采样点生成的标记进行连续交叉关注时，网络可以学习到更丰富的结构和几何关系信息。更具体地说，我们将均匀采样点设置为查询点，并以 GPS 和 C-GPS 采样点作为键和值来设计交叉关注层。为了进一步提高标记的多样性，我们设计了一种对点的可变形操作，以根据输入自适应地调整点。在形状分类和室内场景分割任务上的大量实验结果表明，与最近的点云变换器相比，该技术有很大的提升空间。我们还进行了消融研究，以显示我们提出的交叉关注模块与 GPS 策略的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量