AISFormer:带变压器的模态实例分割

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference Pub Date : 2022-10-12 DOI:10.48550/arXiv.2210.06323

Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le

{"title":"AISFormer:带变压器的模态实例分割","authors":"Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le","doi":"10.48550/arXiv.2210.06323","DOIUrl":null,"url":null,"abstract":"Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"27 1","pages":"712"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"AISFormer: Amodal Instance Segmentation with Transformer\",\"authors\":\"Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le\",\"doi\":\"10.48550/arXiv.2210.06323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer\",\"PeriodicalId\":72437,\"journal\":{\"name\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"volume\":\"27 1\",\"pages\":\"712\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.06323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.06323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

模态实例分割(AIS)旨在分割对象实例的可见部分和可能遮挡部分的区域。虽然基于Mask r - cnn的AIS方法已经显示出有希望的结果，但由于有限的接受野，它们无法模拟高水平的特征一致性。最新的基于变压器的模型在视觉任务上表现令人印象深刻，甚至比卷积神经网络(CNN)更好。在这项工作中，我们提出了AISFormer，一个AIS框架，具有基于transformer的掩模头。AISFormer通过将遮罩、可见遮罩、模态遮罩和不可见遮罩作为可学习的查询，显式地对对象感兴趣区域内的复杂一致性进行建模。具体来说，AISFormer包含四个模块:(1)特征编码:提取ROI，学习近距离和远程视觉特征。(ii)掩码变压器解码:通过变压器解码器生成闭塞、可见和模态掩码查询嵌入;(iii)不可见掩码嵌入:对模态和可见掩码之间的相干性进行建模;(iv)掩码预测:估计输出掩码，包括闭塞、可见、模态和不可见。我们在三个具有挑战性的基准(即KINS, D2SA和coca -cls)上进行了广泛的实验和消融研究，以评估AISFormer的有效性。代码可从https://github.com/UARK-AICV/AISFormer获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AISFormer: Amodal Instance Segmentation with Transformer

Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference

自引率

0.00%

发文量