{"title":"The Role of ViT Design and Training in Robustness to Common Corruptions","authors":"Rui Tian;Zuxuan Wu;Qi Dai;Micah Goldblum;Han Hu;Yu-Gang Jiang","doi":"10.1109/TMM.2024.3521721","DOIUrl":null,"url":null,"abstract":"Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1374-1385"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812859/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.