AISFormer:带变压器的模态实例分割

Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le
{"title":"AISFormer:带变压器的模态实例分割","authors":"Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le","doi":"10.48550/arXiv.2210.06323","DOIUrl":null,"url":null,"abstract":"Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"27 1","pages":"712"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"AISFormer: Amodal Instance Segmentation with Transformer\",\"authors\":\"Minh-Triet Tran, Khoa T. Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan T. H. Le\",\"doi\":\"10.48550/arXiv.2210.06323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer\",\"PeriodicalId\":72437,\"journal\":{\"name\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"volume\":\"27 1\",\"pages\":\"712\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.06323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.06323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

模态实例分割(AIS)旨在分割对象实例的可见部分和可能遮挡部分的区域。虽然基于Mask r - cnn的AIS方法已经显示出有希望的结果,但由于有限的接受野,它们无法模拟高水平的特征一致性。最新的基于变压器的模型在视觉任务上表现令人印象深刻,甚至比卷积神经网络(CNN)更好。在这项工作中,我们提出了AISFormer,一个AIS框架,具有基于transformer的掩模头。AISFormer通过将遮罩、可见遮罩、模态遮罩和不可见遮罩作为可学习的查询,显式地对对象感兴趣区域内的复杂一致性进行建模。具体来说,AISFormer包含四个模块:(1)特征编码:提取ROI,学习近距离和远程视觉特征。(ii)掩码变压器解码:通过变压器解码器生成闭塞、可见和模态掩码查询嵌入;(iii)不可见掩码嵌入:对模态和可见掩码之间的相干性进行建模;(iv)掩码预测:估计输出掩码,包括闭塞、可见、模态和不可见。我们在三个具有挑战性的基准(即KINS, D2SA和coca -cls)上进行了广泛的实验和消融研究,以评估AISFormer的有效性。代码可从https://github.com/UARK-AICV/AISFormer获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AISFormer: Amodal Instance Segmentation with Transformer
Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信