{"title":"Deep Learning And Interactivity For Video Rotoscoping","authors":"Shivam Saboo, F. Lefèbvre, Vincent Demoulin","doi":"10.1109/ICIP40778.2020.9191057","DOIUrl":null,"url":null,"abstract":"In this work we extend the idea of object co-segmentation [10] to perform interactive video segmentation. Our framework predicts the coordinates of vertices along the boundary of an object for two frames of a video simultaneously. The predicted vertices are interactive in nature and a user interaction on one frame assists the network to correct the predictions for both frames. We employ attention mechanism at the encoder stage and a simple combination network at the decoder stage which allows the network to perform this simultaneous correction efficiently. The framework is also robust to the distance between the two input frames as it can handle a distance of up to 50 frames in between the two inputs.We train our model on professional dataset, which consists pixel accurate annotations given by professional Roto artists. We test our model on DAVIS [15] and achieve state of the art results in both automatic and interactive mode surpassing Curve-GCN [11] and PolyRNN++ [1].","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP40778.2020.9191057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this work we extend the idea of object co-segmentation [10] to perform interactive video segmentation. Our framework predicts the coordinates of vertices along the boundary of an object for two frames of a video simultaneously. The predicted vertices are interactive in nature and a user interaction on one frame assists the network to correct the predictions for both frames. We employ attention mechanism at the encoder stage and a simple combination network at the decoder stage which allows the network to perform this simultaneous correction efficiently. The framework is also robust to the distance between the two input frames as it can handle a distance of up to 50 frames in between the two inputs.We train our model on professional dataset, which consists pixel accurate annotations given by professional Roto artists. We test our model on DAVIS [15] and achieve state of the art results in both automatic and interactive mode surpassing Curve-GCN [11] and PolyRNN++ [1].