Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
{"title":"SimpleClick:利用简单视觉变换器进行交互式图像分割","authors":"Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer","doi":"10.1109/iccv51070.2023.02037","DOIUrl":null,"url":null,"abstract":"<p><p>Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the <i>de-facto</i> architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves <b>4.15</b> NoC@90 on SBD, improving <b>21.8%</b> over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We provide a detailed computational analysis, highlighting the suitability of our method as a practical annotation tool.</p>","PeriodicalId":74564,"journal":{"name":"Proceedings. IEEE International Conference on Computer Vision","volume":"2023 ","pages":"22233-22243"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378330/pdf/","citationCount":"0","resultStr":"{\"title\":\"SimpleClick: Interactive Image Segmentation with Simple Vision Transformers.\",\"authors\":\"Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer\",\"doi\":\"10.1109/iccv51070.2023.02037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the <i>de-facto</i> architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves <b>4.15</b> NoC@90 on SBD, improving <b>21.8%</b> over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We provide a detailed computational analysis, highlighting the suitability of our method as a practical annotation tool.</p>\",\"PeriodicalId\":74564,\"journal\":{\"name\":\"Proceedings. IEEE International Conference on Computer Vision\",\"volume\":\"2023 \",\"pages\":\"22233-22243\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378330/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE International Conference on Computer Vision\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iccv51070.2023.02037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iccv51070.2023.02037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
基于点击的交互式图像分割旨在通过有限的用户点击来提取对象。分层骨干是当前方法的事实架构。最近,普通的非分层视觉转换器(ViT)已成为高密度预测任务中具有竞争力的骨干。这种设计使原始的 ViT 成为一个基础模型,可以针对下游任务进行微调,而无需重新设计分层骨干进行预训练。虽然这种设计简单有效,但在交互式图像分割方面还没有进行过探索。为了填补这一空白,我们提出了 SimpleClick,这是第一种利用普通骨干网的交互式分割方法。在普通骨干网的基础上,我们引入了一个对称补丁嵌入层,只需对骨干网本身稍作修改,就能将点击编码到骨干网中。通过对普通骨干网进行掩码自动编码器(MAE)预训练,SimpleClick 实现了最先进的性能。值得注意的是,我们的方法在 SBD 上实现了 4.15 NoC@90,比之前的最佳结果提高了 21.8%。在医学图像上的广泛评估证明了我们方法的通用性。我们提供了详细的计算分析,强调了我们的方法作为实用注释工具的适用性。
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers.
Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We provide a detailed computational analysis, highlighting the suitability of our method as a practical annotation tool.