3D convolutional long short-term encoder-decoder network for moving object segmentation

IF 1.2 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Science and Information Systems Pub Date : 2023-01-01 DOI:10.2298/csis230129044t

Anil Turker, Ender Eksioglu

{"title":"3D convolutional long short-term encoder-decoder network for moving object segmentation","authors":"Anil Turker, Ender Eksioglu","doi":"10.2298/csis230129044t","DOIUrl":null,"url":null,"abstract":"Moving object segmentation (MOS) is one of the important and well studied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in un seen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which in cludes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the pro posed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and sep arated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive re sults on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.","PeriodicalId":50636,"journal":{"name":"Computer Science and Information Systems","volume":"27 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2298/csis230129044t","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Moving object segmentation (MOS) is one of the important and well studied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in un seen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which in cludes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the pro posed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and sep arated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive re sults on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.

查看原文本刊更多论文

三维卷积长短期编码器-解码器网络用于运动目标分割

运动目标分割(MOS)是一项重要且研究得很好的计算机视觉任务，用于各种应用，如视频监控系统、人体跟踪、自动驾驶汽车和视频压缩。虽然传统的MOS方法依赖于手工制作的特征或背景建模，但使用卷积神经网络(cnn)的深度学习方法已被证明在提取特征和实现更好的准确性方面更有效。然而，大多数基于深度学习的MOS方法都提供了依赖于场景的解决方案，导致在之前未见过的视频内容上测试时性能下降。由于空间特征不足以表示运动信息，因此需要将空间特征和时间特征结合使用，才能在未见视频中取得成功。为了解决这个问题，我们提出了MOS-Net深度框架，这是一个结合时空特征的编码器-解码器网络，使用通量张量算法，3D cnn和ConvLSTM的不同变体。MOS-Net 2.0是基本MOS-Net结构的增强版本，其中在3D cnn中添加了额外的ConvL STM模块，用于提取长期时空特征。在框架的最后阶段，编码器-解码器网络的输出，前景概率图，被阈值化，以产生一个二进制掩码，其中运动物体在前景中，其余的形成背景。此外，还进行了消融研究，使用ChangeDetection2014 (CDnet2014)评估不同组合作为拟议网络的输入，其中包括具有动态背景，恶劣天气和照明变化的具有挑战性的视频。在大多数方法中，训练和测试策略都没有公布，这使得比较算法结果变得困难。此外，所提出的方法可以被评估为视频优化或视频不可知论。在视频优化方法中，训练集和测试集是随机获得的，并从整个数据集中分离出来。本文将该方法的结果与使用相同评估策略的文献中的竞争方法进行了比较。已经观察到，引入的MOS网络在CDnet2014数据集上给出了高度竞争的结果。在这项工作中提供的模拟的源代码可在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Science and Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

2.30

自引率

21.40%

发文量

审稿时长

7.5 months

期刊介绍： About the journal Home page Contact information Aims and scope Indexing information Editorial policies ComSIS consortium Journal boards Managing board For authors Information for contributors Paper submission Article submission through OJS Copyright transfer form Download section For readers Forthcoming articles Current issue Archive Subscription For reviewers View and review submissions News Journal''s Facebook page Call for special issue New issue notification Aims and scope Computer Science and Information Systems (ComSIS) is an international refereed journal, published in Serbia. The objective of ComSIS is to communicate important research and development results in the areas of computer science, software engineering, and information systems.