Cathaoir Agnew , Eoin M. Grua , Pepijn Van de Ven , Patrick Denny , Ciarán Eising , Anthony Scanlan
{"title":"Pretraining instance segmentation models with bounding box annotations","authors":"Cathaoir Agnew , Eoin M. Grua , Pepijn Van de Ven , Patrick Denny , Ciarán Eising , Anthony Scanlan","doi":"10.1016/j.iswa.2024.200454","DOIUrl":null,"url":null,"abstract":"<div><div>Annotating datasets for fully supervised instance segmentation tasks can be arduous and time-consuming, requiring a significant effort and cost investment. Producing bounding box annotations instead constitutes a significant reduction in this investment, but bounding box annotated data alone are not suitable for instance segmentation. This work utilizes ground truth bounding boxes to define coarsely annotated polygon masks, which we refer to as weak annotations, on which the models are pre-trained. We investigate the effect of pretraining on data with weak annotations and further fine-tuning on data with strong annotations, that is, finely annotated polygon masks for instance segmentation. The COCO 2017 detection dataset along with 3 model architectures, SOLOv2, Mask-RCNN, and Mask2former, were used to conduct experiments investigating the effect of pretraining on weak annotations. The Cityscapes and Pascal VOC 2012 datasets were used to validate this approach. The empirical results suggest two key outcomes from this investigation. Firstly, a sequential approach to annotating large-scale instance segmentation datasets would be beneficial, enabling higher-performance models in faster timeframes. This is accomplished by first labeling bounding boxes on your data followed by polygon masks. Secondly, it is possible to leverage object detection datasets for pretraining instance segmentation models while maintaining competitive results in the downstream task. This is reflected with 97.5%, 100.4% & 101.3% of the fully supervised performance being achieved with just 1%, 5% & 10% of the instance segmentation annotations of the COCO training dataset being utilized for the best performing model, Mask2former with a Swin-L backbone.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"24 ","pages":"Article 200454"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324001285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Annotating datasets for fully supervised instance segmentation tasks can be arduous and time-consuming, requiring a significant effort and cost investment. Producing bounding box annotations instead constitutes a significant reduction in this investment, but bounding box annotated data alone are not suitable for instance segmentation. This work utilizes ground truth bounding boxes to define coarsely annotated polygon masks, which we refer to as weak annotations, on which the models are pre-trained. We investigate the effect of pretraining on data with weak annotations and further fine-tuning on data with strong annotations, that is, finely annotated polygon masks for instance segmentation. The COCO 2017 detection dataset along with 3 model architectures, SOLOv2, Mask-RCNN, and Mask2former, were used to conduct experiments investigating the effect of pretraining on weak annotations. The Cityscapes and Pascal VOC 2012 datasets were used to validate this approach. The empirical results suggest two key outcomes from this investigation. Firstly, a sequential approach to annotating large-scale instance segmentation datasets would be beneficial, enabling higher-performance models in faster timeframes. This is accomplished by first labeling bounding boxes on your data followed by polygon masks. Secondly, it is possible to leverage object detection datasets for pretraining instance segmentation models while maintaining competitive results in the downstream task. This is reflected with 97.5%, 100.4% & 101.3% of the fully supervised performance being achieved with just 1%, 5% & 10% of the instance segmentation annotations of the COCO training dataset being utilized for the best performing model, Mask2former with a Swin-L backbone.