{"title":"3D convolutional long short-term encoder-decoder network for moving object segmentation","authors":"Anil Turker, Ender Eksioglu","doi":"10.2298/csis230129044t","DOIUrl":null,"url":null,"abstract":"Moving object segmentation (MOS) is one of the important and well studied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in un seen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which in cludes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the pro posed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and sep arated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive re sults on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.","PeriodicalId":50636,"journal":{"name":"Computer Science and Information Systems","volume":"27 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2298/csis230129044t","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Moving object segmentation (MOS) is one of the important and well studied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in un seen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which in cludes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the pro posed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and sep arated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive re sults on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.
期刊介绍:
About the journal
Home page
Contact information
Aims and scope
Indexing information
Editorial policies
ComSIS consortium
Journal boards
Managing board
For authors
Information for contributors
Paper submission
Article submission through OJS
Copyright transfer form
Download section
For readers
Forthcoming articles
Current issue
Archive
Subscription
For reviewers
View and review submissions
News
Journal''s Facebook page
Call for special issue
New issue notification
Aims and scope
Computer Science and Information Systems (ComSIS) is an international refereed journal, published in Serbia. The objective of ComSIS is to communicate important research and development results in the areas of computer science, software engineering, and information systems.