Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia
{"title":"Visual information fidelity based frame level rate control for H.265/HEVC","authors":"Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia","doi":"10.1016/j.image.2024.117245","DOIUrl":"10.1016/j.image.2024.117245","url":null,"abstract":"<div><div>Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based <span><math><mi>λ</mi></math></span> and MSE-based <span><math><mi>λ</mi></math></span> for <span><math><mi>λ</mi></math></span>-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based <span><math><mi>λ</mi></math></span> directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4<span><math><mtext>%</mtext></math></span>, 4.0<span><math><mtext>%</mtext></math></span> and 3.3<span><math><mtext>%</mtext></math></span> in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117245"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu, Xin Ma, Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":"10.1016/j.image.2024.117244","url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vocal cord anomaly detection based on Local Fine-Grained Contour Features","authors":"Yuqi Fan , Han Ye , Xiaohui Yuan","doi":"10.1016/j.image.2024.117225","DOIUrl":"10.1016/j.image.2024.117225","url":null,"abstract":"<div><div>Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117225"},"PeriodicalIF":3.4,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions","authors":"Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli","doi":"10.1016/j.image.2024.117223","DOIUrl":"10.1016/j.image.2024.117223","url":null,"abstract":"<div><div>Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117223"},"PeriodicalIF":3.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu
{"title":"HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos","authors":"Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu","doi":"10.1016/j.image.2024.117224","DOIUrl":"10.1016/j.image.2024.117224","url":null,"abstract":"<div><div>Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117224"},"PeriodicalIF":3.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High efficiency deep image compression via channel-wise scale adaptive latent representation learning","authors":"Chenhao Wu, Qingbo Wu, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu","doi":"10.1016/j.image.2024.117227","DOIUrl":"10.1016/j.image.2024.117227","url":null,"abstract":"<div><div>Recent learning based neural image compression methods have achieved impressive rate–distortion (RD) performance via the sophisticated context entropy model, which performs well in capturing the spatial correlations of latent features. However, due to the dependency on the adjacent or distant decoded features, existing methods require an inefficient serial processing structure, which significantly limits its practicability. Instead of pursuing computationally expensive entropy estimation, we propose to reduce the spatial redundancy via the channel-wise scale adaptive latent representation learning, whose entropy coding is spatially context-free and parallelizable. Specifically, the proposed encoder adaptively determines the scale of the latent features via a learnable binary mask, which is optimized with the RD cost. In this way, lower-scale latent representation will be allocated to the channels with higher spatial redundancy, which consumes fewer bits and vice versa. The downscaled latent features could be well recovered with a lightweight inter-channel upconversion module in the decoder. To compensate for the entropy estimation performance degradation, we further develop an inter-scale hyperprior entropy model, which supports the high efficiency parallel encoding/decoding within each scale of the latent features. Extensive experiments are conducted to illustrate the efficacy of the proposed method. Our method achieves bitrate savings of 18.23%, 19.36%, and 27.04% over HEVC Intra, along with decoding speeds that are 46 times, 48 times, and 51 times faster than the baseline method on the Kodak, Tecnick, and CLIC datasets, respectively.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117227"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach
{"title":"Text in the dark: Extremely low-light text image enhancement","authors":"Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach","doi":"10.1016/j.image.2024.117222","DOIUrl":"10.1016/j.image.2024.117222","url":null,"abstract":"<div><div>Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder–decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model’s efficacy in text detection under extremely low-light conditions. Code and datasets will be released publicly at <span><span>https://github.com/chunchet-ng/Text-in-the-Dark</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117222"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Double supervision for scene text detection and recognition based on BMINet","authors":"Hanyang Wan, Ruoyun Liu, Li Yu","doi":"10.1016/j.image.2024.117226","DOIUrl":"10.1016/j.image.2024.117226","url":null,"abstract":"<div><div>Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117226"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao
{"title":"A new two-stage low-light enhancement network with progressive attention fusion strategy","authors":"Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao","doi":"10.1016/j.image.2024.117229","DOIUrl":"10.1016/j.image.2024.117229","url":null,"abstract":"<div><div>Low-light image enhancement is a very challenging subject in the field of computer vision such as visual surveillance, driving behavior analysis, and medical imaging . It has a large number of degradation problems such as accumulated noise, artifacts, and color distortion. Therefore, how to solve the degradation problems and obtain clear images with high visual quality has become an important issue. It can effectively improve the performance of high-level computer vision tasks. In this study, we propose a new two-stage low-light enhancement network with a progressive attention fusion strategy, and the two hallmarks of this method are the use of global feature fusion (GFF) and local detail restoration (LDR), which can enrich the global content of the image and restore local details. Experimental results on the LOL dataset show that the proposed model can achieve good enhancement effects. Moreover, on the benchmark dataset without reference images, the proposed model also obtains a better NIQE score, which outperforms most existing state-of-the-art methods in both quantitative and qualitative evaluations. All these verify the effectiveness and superiority of the proposed method.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117229"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yueying Luo, Kangjian He, Dan Xu, Hongzhen Shi, Wenxia Yin
{"title":"Infrared and visible image fusion based on hybrid multi-scale decomposition and adaptive contrast enhancement","authors":"Yueying Luo, Kangjian He, Dan Xu, Hongzhen Shi, Wenxia Yin","doi":"10.1016/j.image.2024.117228","DOIUrl":"10.1016/j.image.2024.117228","url":null,"abstract":"<div><div>Effectively fusing infrared and visible images enhances the visibility of infrared target information while capturing visual details. Balancing the brightness and contrast of the fusion image adequately has posed a significant challenge. Moreover, preserving detailed information in fusion images has been problematic. To address these issues, this paper proposes a fusion algorithm based on multi-scale decomposition and adaptive contrast enhancement. Initially, we present a hybrid multi-scale decomposition method aimed at extracting valuable information comprehensively from the source image. Subsequently, we advance an adaptive base layer optimization approach to regulate the brightness and contrast of the resultant fusion image. Lastly, we design a weight mapping rule grounded in saliency detection to integrate small-scale layers, thereby conserving the edge structure within the fusion outcome. Both qualitative and quantitative experimental results affirm the superiority of the proposed method over 11 state-of-the-art image fusion methods. Our method excels in preserving more texture and achieving higher contrast, which proves advantageous for monitoring tasks.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117228"},"PeriodicalIF":3.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}