A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Pub Date : 2025-09-05 DOI:10.3390/jimaging11090304

Leonardo Scabini, Andre Sacilotti, Kallil M Zielinski, Lucas C Ribas, Bernard De Baets, Odemir M Bruno

{"title":"A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis.","authors":"Leonardo Scabini, Andre Sacilotti, Kallil M Zielinski, Lucas C Ribas, Bernard De Baets, Odemir M Bruno","doi":"10.3390/jimaging11090304","DOIUrl":null,"url":null,"abstract":"<p><p>Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance on a range of visual recognition problems. However, the suitability of ViTs for texture recognition remains underexplored. In this work, we investigate the capabilities and limitations of ViTs for texture recognition by analyzing 25 different ViT variants as feature extractors and comparing them to CNN-based and hand-engineered approaches. Our evaluation encompasses both accuracy and efficiency, aiming to assess the trade-offs involved in applying ViTs to texture analysis. Our results indicate that ViTs generally outperform CNN-based and hand-engineered models, particularly when using strong pre-training and in-the-wild texture datasets. Notably, BeiTv2-B/16 achieves the highest average accuracy (85.7%), followed by ViT-B/16-DINO (84.1%) and Swin-B (80.8%), outperforming the ResNet50 baseline (75.5%) and the hand-engineered baseline (73.4%). As a lightweight alternative, EfficientFormer-L3 attains a competitive average accuracy of 78.9%. In terms of efficiency, although ViT-B and BeiT(v2) have a higher number of GFLOPs and parameters, they achieve significantly faster feature extraction on GPUs compared to ResNet50. These findings highlight the potential of ViTs as a powerful tool for texture analysis while also pointing to areas for future exploration, such as efficiency improvements and domain-specific adaptations.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"11 9","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12470584/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging11090304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance on a range of visual recognition problems. However, the suitability of ViTs for texture recognition remains underexplored. In this work, we investigate the capabilities and limitations of ViTs for texture recognition by analyzing 25 different ViT variants as feature extractors and comparing them to CNN-based and hand-engineered approaches. Our evaluation encompasses both accuracy and efficiency, aiming to assess the trade-offs involved in applying ViTs to texture analysis. Our results indicate that ViTs generally outperform CNN-based and hand-engineered models, particularly when using strong pre-training and in-the-wild texture datasets. Notably, BeiTv2-B/16 achieves the highest average accuracy (85.7%), followed by ViT-B/16-DINO (84.1%) and Swin-B (80.8%), outperforming the ResNet50 baseline (75.5%) and the hand-engineered baseline (73.4%). As a lightweight alternative, EfficientFormer-L3 attains a competitive average accuracy of 78.9%. In terms of efficiency, although ViT-B and BeiT(v2) have a higher number of GFLOPs and parameters, they achieve significantly faster feature extraction on GPUs compared to ResNet50. These findings highlight the potential of ViTs as a powerful tool for texture analysis while also pointing to areas for future exploration, such as efficiency improvements and domain-specific adaptations.

查看原文本刊更多论文

纹理分析中用于特征提取的视觉变换比较研究。

纹理是图像中重要的视觉属性，在许多模式识别任务中起着重要作用。虽然卷积神经网络（cnn）是纹理分析最有效的方法之一，但视觉变换（ViTs）等替代架构最近在一系列视觉识别问题上表现出了卓越的性能。然而，ViTs在纹理识别中的适用性仍有待进一步研究。在这项工作中，我们通过分析25种不同的ViT变体作为特征提取器，并将它们与基于cnn和手工设计的方法进行比较，研究了ViT用于纹理识别的能力和局限性。我们的评估包括准确性和效率，旨在评估将ViTs应用于纹理分析所涉及的权衡。我们的研究结果表明，ViTs通常优于基于cnn和手工设计的模型，特别是在使用强预训练和野外纹理数据集时。值得注意的是，BeiTv2-B/16的平均准确率最高（85.7%），其次是viti - b /16- dino（84.1%）和swwin - b(80.8%)，优于ResNet50基线（75.5%）和手工设计基线（73.4%）。作为轻量级替代方案，EfficientFormer-L3的平均准确率达到了78.9%。在效率方面，虽然ViT-B和BeiT（v2）具有更高的gflop数量和参数，但与ResNet50相比，它们在gpu上实现的特征提取速度明显更快。这些发现突出了vit作为纹理分析的强大工具的潜力，同时也指出了未来探索的领域，例如效率提高和特定领域的适应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊