{"title":"Optimal Input Selection for Single Object Tracking using RGB-Thermal Camera","authors":"Siti Raihanah Abdani, Mohd Asyraf Zulkifley","doi":"10.1109/iscaie54458.2022.9794503","DOIUrl":null,"url":null,"abstract":"In the modern era, Object tracking has been used in various intelligent applications that include surveillance, autonomous car, smart harvesting, and action recognition systems. In a video-based setting, an object tracking algorithm aims to correlate the object of interest throughout the frames by building the movement trajectory. The most popular sensing input to the tracking algorithm is the RGB channels, yet, it performs relatively poor in low lighting surroundings, especially if the object’s appearance is similar appearance to the background. Therefore, multi-modal input through a combination of RGB and thermal images has been explored to overcome the weakness of a single modality input. For a tracker that is based on the scoring output of convolutional neural networks, pre-trained weights are usually used to represent the feature extraction module. It is the norm that the weights in convolutional layers are frozen, while the parameters fitting is only done in the fully connected layers. Since the weights are pre-trained, the optimal number of channels is only three, which poses a problem for a tracker with RGB-Thermal input. Two schemes have been devised in this work, either to slice the pre-trained weights to accommodate an additional thermal channel, or to duplicate the thermal channel into a three-channel format. Hence, the performance of 4D and 6D inputs are tested on three state-of-the-art trackers, which are MDNet, TCNN, and MMCNN. The best performance result was produced by TCNN-4D with an expected average overlap of 0.2534, accuracy of 0.5963, and reliability of 0.9329. The results indicate that an optimized slicing method to select the best pre-trained weights will produce a significant tracking improvement even if fewer input channels are used. Index Terms—Single Object Tracking, RGB-Thermal Camera, Convolutional Neural Networks, Optimal Input Selection","PeriodicalId":395670,"journal":{"name":"2022 IEEE 12th Symposium on Computer Applications & Industrial Electronics (ISCAIE)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 12th Symposium on Computer Applications & Industrial Electronics (ISCAIE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iscaie54458.2022.9794503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the modern era, Object tracking has been used in various intelligent applications that include surveillance, autonomous car, smart harvesting, and action recognition systems. In a video-based setting, an object tracking algorithm aims to correlate the object of interest throughout the frames by building the movement trajectory. The most popular sensing input to the tracking algorithm is the RGB channels, yet, it performs relatively poor in low lighting surroundings, especially if the object’s appearance is similar appearance to the background. Therefore, multi-modal input through a combination of RGB and thermal images has been explored to overcome the weakness of a single modality input. For a tracker that is based on the scoring output of convolutional neural networks, pre-trained weights are usually used to represent the feature extraction module. It is the norm that the weights in convolutional layers are frozen, while the parameters fitting is only done in the fully connected layers. Since the weights are pre-trained, the optimal number of channels is only three, which poses a problem for a tracker with RGB-Thermal input. Two schemes have been devised in this work, either to slice the pre-trained weights to accommodate an additional thermal channel, or to duplicate the thermal channel into a three-channel format. Hence, the performance of 4D and 6D inputs are tested on three state-of-the-art trackers, which are MDNet, TCNN, and MMCNN. The best performance result was produced by TCNN-4D with an expected average overlap of 0.2534, accuracy of 0.5963, and reliability of 0.9329. The results indicate that an optimized slicing method to select the best pre-trained weights will produce a significant tracking improvement even if fewer input channels are used. Index Terms—Single Object Tracking, RGB-Thermal Camera, Convolutional Neural Networks, Optimal Input Selection