AT-ViT: Area-Targeted Multi-View Vision Transformer With Cross-Attention and Multi-Scale Patching for Plant Trait Recognition in Herbarium Images

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2026-03-14 DOI:10.1049/cvi2.70059

Amani Sedrat, Takieddine Chehhat, Youcef Sklab, Hanane Ariouat, Abderrazak Sebaa, Eric Chenin, Jean-Daniel Zucker, Edi Prifti

{"title":"AT-ViT: Area-Targeted Multi-View Vision Transformer With Cross-Attention and Multi-Scale Patching for Plant Trait Recognition in Herbarium Images","authors":"Amani Sedrat, Takieddine Chehhat, Youcef Sklab, Hanane Ariouat, Abderrazak Sebaa, Eric Chenin, Jean-Daniel Zucker, Edi Prifti","doi":"10.1049/cvi2.70059","DOIUrl":null,"url":null,"abstract":"<p>Automated plant traits recognition from herbarium images is essential for plant sciences, yet it remains challenging because background elements (e.g., textual labels, mounting artefacts and colour charts) can introduce shortcut learning, leading models to rely on spurious nonplant cues rather than plant morphology. This bias degrades both generalisation and interpretability. In this paper, we introduce <b>AT-ViT</b>, a dual-branch vision transformer that jointly encodes raw herbarium scans and their segmented-derived counterparts via a multi-scale, multi-view cross-attention fusion scheme. AT-ViT further incorporates a mask-guided patch weighting mechanism that amplifies plant-relevant regions and attenuates background-driven features. By learning from the original scans while being guided by segmentation masks through the mask-guided patch reweighting mechanism, the model is encouraged to focus on plant organs and learn plant-centric representations more effectively. Across multiple trait classification tasks (e.g., leaf base shape, thorns), AT-ViT delivers consistent accuracy gains, improves attention localisation on plant regions and exhibits increased robustness under synthetic background perturbations. Specifically, AT-ViT substantially improves spatial attention grounding, boosting plant-region alignment (Avg IoU_p: +15.66 to +18.03 pp) while reducing background overlap (Avg IoU_b: −27.92 to −31.02 pp) relative to CrossViT, and remains markedly more robust to background perturbations, outperforming ResNet101 by up to +32.32 accuracy points and CrossViT by up to +5.07 points under background-noise conditions.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70059","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.70059","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Automated plant traits recognition from herbarium images is essential for plant sciences, yet it remains challenging because background elements (e.g., textual labels, mounting artefacts and colour charts) can introduce shortcut learning, leading models to rely on spurious nonplant cues rather than plant morphology. This bias degrades both generalisation and interpretability. In this paper, we introduce AT-ViT, a dual-branch vision transformer that jointly encodes raw herbarium scans and their segmented-derived counterparts via a multi-scale, multi-view cross-attention fusion scheme. AT-ViT further incorporates a mask-guided patch weighting mechanism that amplifies plant-relevant regions and attenuates background-driven features. By learning from the original scans while being guided by segmentation masks through the mask-guided patch reweighting mechanism, the model is encouraged to focus on plant organs and learn plant-centric representations more effectively. Across multiple trait classification tasks (e.g., leaf base shape, thorns), AT-ViT delivers consistent accuracy gains, improves attention localisation on plant regions and exhibits increased robustness under synthetic background perturbations. Specifically, AT-ViT substantially improves spatial attention grounding, boosting plant-region alignment (Avg IoU_p: +15.66 to +18.03 pp) while reducing background overlap (Avg IoU_b: −27.92 to −31.02 pp) relative to CrossViT, and remains markedly more robust to background perturbations, outperforming ResNet101 by up to +32.32 accuracy points and CrossViT by up to +5.07 points under background-noise conditions.

Abstract Image

查看原文本刊更多论文

基于交叉关注和多尺度拼接的区域目标多视点视觉转换器在植物标本室图像中的应用

从植物标本馆图像中自动识别植物性状对植物科学至关重要，但它仍然具有挑战性，因为背景元素（例如文本标签、安装人工制品和颜色图）可能引入捷径学习，导致模型依赖于虚假的非植物线索，而不是植物形态。这种偏见降低了概括性和可解释性。在本文中，我们介绍了一种双分支视觉转换器AT-ViT，它通过多尺度、多视角交叉注意融合方案对原始植物标本扫描及其衍生的片段进行联合编码。AT-ViT进一步结合了掩模引导的斑块加权机制，该机制放大了植物相关区域，减弱了背景驱动的特征。通过对原始扫描进行学习，同时通过掩模引导的patch重加权机制，在分割掩模的引导下，鼓励模型更有效地关注植物器官并学习以植物为中心的表征。在多个性状分类任务（如叶基形状、刺）中，AT-ViT提供了一致的准确性增益，提高了对植物区域的注意力定位，并在合成背景扰动下表现出更高的鲁棒性。具体来说，AT-ViT大大改善了空间注意力接地，提高了植物区域的对位（Avg IoU_p: +15.66至+18.03 pp），同时减少了背景重叠（Avg IoU_b:−27.92至−31.02 pp），并且对背景扰动的鲁棒性更强，在背景噪声条件下比ResNet101高32.32精度点，比CrossViT高5.07点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf