TTST: A Top-k Token Selective Transformer for Remote Sensing Image Super-Resolution

IF 13.7
Yi Xiao;Qiangqiang Yuan;Kui Jiang;Jiang He;Chia-Wen Lin;Liangpei Zhang
{"title":"TTST: A Top-k Token Selective Transformer for Remote Sensing Image Super-Resolution","authors":"Yi Xiao;Qiangqiang Yuan;Kui Jiang;Jiang He;Chia-Wen Lin;Liangpei Zhang","doi":"10.1109/TIP.2023.3349004","DOIUrl":null,"url":null,"abstract":"Transformer-based method has demonstrated promising performance in image super-resolution tasks, due to its long-range and global aggregation capability. However, the existing Transformer brings two critical challenges for applying it in large-area earth observation scenes: (1) redundant token representation due to most irrelevant tokens; (2) single-scale representation which ignores scale correlation modeling of similar ground observation targets. To this end, this paper proposes to adaptively eliminate the interference of irreverent tokens for a more compact self-attention calculation. Specifically, we devise a Residual Token Selective Group (RTSG) to grasp the most crucial token by dynamically selecting the top-\n<inline-formula> <tex-math>$k$ </tex-math></inline-formula>\n keys in terms of score ranking for each query. For better feature aggregation, a Multi-scale Feed-forward Layer (MFL) is developed to generate an enriched representation of multi-scale feature mixtures during feed-forward process. Moreover, we also proposed a Global Context Attention (GCA) to fully explore the most informative components, thus introducing more inductive bias to the RTSG for an accurate reconstruction. In particular, multiple cascaded RTSGs form our final Top-\n<inline-formula> <tex-math>$k$ </tex-math></inline-formula>\n Token Selective Transformer (TTST) to achieve progressive representation. Extensive experiments on simulated and real-world remote sensing datasets demonstrate our TTST could perform favorably against state-of-the-art CNN-based and Transformer-based methods, both qualitatively and quantitatively. In brief, TTST outperforms the state-of-the-art approach (HAT-L) in terms of PSNR by 0.14 dB on average, but only accounts for 47.26% and 46.97% of its computational cost and parameters. The code and pre-trained TTST will be available at \n<uri>https://github.com/XY-boy/TTST</uri>\n for validation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"738-752"},"PeriodicalIF":13.7000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10387229/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer-based method has demonstrated promising performance in image super-resolution tasks, due to its long-range and global aggregation capability. However, the existing Transformer brings two critical challenges for applying it in large-area earth observation scenes: (1) redundant token representation due to most irrelevant tokens; (2) single-scale representation which ignores scale correlation modeling of similar ground observation targets. To this end, this paper proposes to adaptively eliminate the interference of irreverent tokens for a more compact self-attention calculation. Specifically, we devise a Residual Token Selective Group (RTSG) to grasp the most crucial token by dynamically selecting the top- $k$ keys in terms of score ranking for each query. For better feature aggregation, a Multi-scale Feed-forward Layer (MFL) is developed to generate an enriched representation of multi-scale feature mixtures during feed-forward process. Moreover, we also proposed a Global Context Attention (GCA) to fully explore the most informative components, thus introducing more inductive bias to the RTSG for an accurate reconstruction. In particular, multiple cascaded RTSGs form our final Top- $k$ Token Selective Transformer (TTST) to achieve progressive representation. Extensive experiments on simulated and real-world remote sensing datasets demonstrate our TTST could perform favorably against state-of-the-art CNN-based and Transformer-based methods, both qualitatively and quantitatively. In brief, TTST outperforms the state-of-the-art approach (HAT-L) in terms of PSNR by 0.14 dB on average, but only accounts for 47.26% and 46.97% of its computational cost and parameters. The code and pre-trained TTST will be available at https://github.com/XY-boy/TTST for validation.
TTST:用于遥感图像超分辨率的顶 k 令牌选择变换器。
基于变换器的方法具有远距离和全局聚合能力,在图像超分辨率任务中表现出良好的性能。然而,现有的变换器在应用于大面积对地观测场景时面临两个关键挑战:(1)大多数无关标记导致冗余标记表示;(2)单一尺度表示忽略了相似地面观测目标的尺度相关建模。为此,本文建议自适应地消除无关标记的干扰,以实现更紧凑的自注意力计算。具体来说,我们设计了一个残余标记选择组(RTSG),通过动态选择每个查询中得分排名前 k 的关键标记来抓住最关键的标记。为了更好地聚合特征,我们开发了多尺度前馈层(MFL),以便在前馈过程中生成丰富的多尺度特征混合物表示。此外,我们还提出了全局上下文关注(GCA),以充分发掘信息量最大的成分,从而为 RTSG 引入更多的归纳偏差,以实现准确的重构。特别是,多个级联 RTSG 构成了我们最终的 Top-k 令牌选择转换器(TTST),以实现渐进式表示。在模拟和真实世界遥感数据集上进行的大量实验表明,我们的 TTST 在定性和定量方面均优于最先进的基于 CNN 和变换器的方法。简而言之,TTST 在 PSNR 方面平均比最先进的方法(HAT-L)高出 0.14 dB,但只占其计算成本和参数的 47.26% 和 46.97%。代码和预训练的 TTST 将在 https://github.com/XY-boy/TTST 上提供,以供验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信