Multimedia Systems最新文献_第4页

PS-YOLO: a small object detector based on efficient convolution and multi-scale feature fusion PS-YOLO：基于高效卷积和多尺度特征融合的小物体检测器

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01447-0

Shifeng Peng, Xin Fan, Shengwei Tian, Long Yu

引用次数: 0

Multimodal recommender system based on multi-channel counterfactual learning networks 基于多通道反事实学习网络的多模式推荐系统

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01448-z

Hong Fang, Leiyuxin Sha, Jindong Liang

{"title":"Multimodal recommender system based on multi-channel counterfactual learning networks","authors":"Hong Fang, Leiyuxin Sha, Jindong Liang","doi":"10.1007/s00530-024-01448-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01448-z","url":null,"abstract":"Most multimodal recommender systems utilize multimodal content of user-interacted items as supplemental information to capture user preferences based on historical interactions without considering user-uninteracted items. In contrast, multimodal recommender systems based on causal inference counterfactual learning utilize the causal difference between the multimodal content of user-interacted and user-uninteracted items to purify the content related to user preferences. However, existing methods adopt a unified multimodal channel, which treats each modality equally, resulting in the inability to distinguish users’ tastes for different modalities. Therefore, the differences in users’ attention and perception of different modalities' content cannot be reflected. To cope with the above issue, this paper proposes a novel recommender system based on multi-channel counterfactual learning (MCCL) networks to capture user fine-grained preferences on different modalities. First, two independent channels are established based on the corresponding features for the content of image and text modalities for modality-specific feature extraction. Then, leveraging the counterfactual theory of causal inference, features in each channel unrelated to user preferences are eliminated using the features of the user-uninteracted items. Features related to user preferences are enhanced and multimodal user preferences are modeled at the content level, which portrays the users' taste for the different modalities of items. Finally, semantic entities are extracted to model semantic-level multimodal user preferences, which are fused with historical user interaction information and content-level user preferences for recommendation. Extensive experiments on three different datasets show that our results improve up to 4.17% on NDCG compared to the optimal model.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"16 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring multi-level transformers with feature frame padding network for 3D human pose estimation 利用特征帧填充网络探索用于三维人体姿态估计的多级变换器

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01451-4

Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo

{"title":"Exploring multi-level transformers with feature frame padding network for 3D human pose estimation","authors":"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo","doi":"10.1007/s00530-024-01451-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01451-4","url":null,"abstract":"Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Propagating prior information with transformer for robust visual object tracking 利用变换器传播先验信息，实现稳健的视觉物体跟踪

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01423-8

Yue Wu, Chengtao Cai, Chai Kiat Yeo

{"title":"Propagating prior information with transformer for robust visual object tracking","authors":"Yue Wu, Chengtao Cai, Chai Kiat Yeo","doi":"10.1007/s00530-024-01423-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01423-8","url":null,"abstract":"In recent years, the domain of visual object tracking has witnessed considerable advancements with the advent of deep learning methodologies. Siamese-based trackers have been pivotal, establishing a new architecture with a weight-shared backbone. With the inclusion of the transformer, attention mechanism has been exploited to enhance the feature discriminability across successive frames. However, the limited adaptability of many existing trackers to the different tracking scenarios has led to inaccurate target localization. To effectively solve this issue, in this paper, we have integrated a siamese network with the transformer, where the former utilizes ResNet50 as the backbone network to extract the target features, while the latter consists of an encoder and a decoder, where the encoder can effectively utilize global contextual information to obtain the discriminative features. Simultaneously, we employ the decoder to propagate prior information related to the target, which enables the tracker to successfully locate the target in a variety of environments, enhancing the stability and robustness of the tracker. Extensive experiments on four major public datasets, OTB100, UAV123, GOT10k and LaSOText demonstrate the effectiveness of the proposed method. Its performance surpasses many state-of-the-art trackers. Additionally, the proposed tracker can achieve a tracking speed of 60 fps, meeting the requirements for real-time tracking.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"8 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-level pyramid fusion for efficient stereo matching 多级金字塔融合实现高效立体匹配

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-12 DOI: 10.1007/s00530-024-01419-4

Jiaqi Zhu, Bin Li, Xinhua Zhao

引用次数: 0

Underwater image enhancement based on weighted guided filter image fusion 基于加权导向滤波图像融合的水下图像增强技术

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-12 DOI: 10.1007/s00530-024-01432-7

Dan Xiang, Huihua Wang, Zebin Zhou, Hao Zhao, Pan Gao, Jinwen Zhang, Chun Shan

引用次数: 0

Discrete codebook collaborating with transformer for thangka image inpainting 用于唐卡图像绘制的离散编码本与变换器协作

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-07 DOI: 10.1007/s00530-024-01439-0

Jinxian Bai, Yao Fan, Zhiwei Zhao

{"title":"Discrete codebook collaborating with transformer for thangka image inpainting","authors":"Jinxian Bai, Yao Fan, Zhiwei Zhao","doi":"10.1007/s00530-024-01439-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01439-0","url":null,"abstract":"Thangka, as a precious heritage of painting art, holds irreplaceable research value due to its richness in Tibetan history, religious beliefs, and folk culture. However, it is susceptible to partial damage and form distortion due to natural erosion or inadequate conservation measures. Given the complexity of textures and rich semantics in thangka images, existing image inpainting methods struggle to recover their original artistic style and intricate details. In this paper, we propose a novel approach combining discrete codebook learning with a transformer for image inpainting, tailored specifically for thangka images. In the codebook learning stage, we design an improved network framework based on vector quantization (VQ) codebooks to discretely encode intermediate features of input images, yielding a context-rich discrete codebook. The second phase introduces a parallel transformer module based on a cross-shaped window, which efficiently predicts the index combinations for missing regions under limited computational cost. Furthermore, we devise a multi-scale feature guidance module that progressively fuses features from intact areas with textural features from the codebook, thereby enhancing the preservation of local details in non-damaged regions. We validate the efficacy of our method through qualitative and quantitative experiments on datasets including Celeba-HQ, Places2, and a custom thangka dataset. Experimental results demonstrate that compared to previous methods, our approach successfully reconstructs images with more complete structural information and clearer textural details.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"167 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A deep low-rank semantic factorization method for micro-video multi-label classification 用于微视频多标签分类的深度低阶语义因式分解方法

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01428-3

Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu

{"title":"A deep low-rank semantic factorization method for micro-video multi-label classification","authors":"Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu","doi":"10.1007/s00530-024-01428-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01428-3","url":null,"abstract":"As a prominent manifestation of user-generated content (UGC), micro-video has emerged as a pivotal medium for individuals to document and disseminate their daily experiences. In particular, micro-videos generally encompass abundant content elements that are abstractly described by a group of annotated labels. However, previous methods primarily focus on the discriminability of explicit labels while neglecting corresponding implicit semantics, which are particularly relevant for diverse micro-video characteristics. To address this problem, we develop a deep low-rank semantic factorization (DLRSF) method to perform multi-label classification of micro-videos. Specifically, we introduce a semantic embedding matrix to bridge explicit labels and implicit semantics, and further present a low-rank-regularized semantic learning module to explore the intrinsic lowest-rank semantic attributes. A correlation-driven deep semantic interaction module is designed within a deep factorization framework to enhance interactions among instance features, explicit labels and semantic embeddings. Additionally, inverse covariance analysis is employed to unveil underlying correlation structures between labels and features, thereby making the semantic embeddings more discriminative and improving model generalization ability simultaneously. Extensive experimental results on three available datasets have showcased the superiority of our DLRSF compared with the state-of-the-art methods.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"72 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild 野外细粒度单目三维人脸重建的自我监督学习

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01436-3

Dongjin Huang, Yongsheng Shi, Jinhua Liu, Wen Tang

{"title":"Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild","authors":"Dongjin Huang, Yongsheng Shi, Jinhua Liu, Wen Tang","doi":"10.1007/s00530-024-01436-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01436-3","url":null,"abstract":"Reconstructing 3D face from monocular images is a challenging computer vision task, due to the limitations of traditional 3DMM (3D Morphable Model) and the lack of high-fidelity 3D facial scanning data. To solve this issue, we propose a novel coarse-to-fine self-supervised learning framework for reconstructing fine-grained 3D faces from monocular images in the wild. In the coarse stage, face parameters extracted from a single image are used to reconstruct a coarse 3D face through a 3DMM. In the refinement stage, we design a wavelet transform perception model to extract facial details in different frequency domains from an input image. Furthermore, we propose a depth displacement module based on the wavelet transform perception model to generate a refined displacement map from the unwrapped UV textures of the input image and rendered coarse face, which can be used to synthesize detailed 3D face geometry. Moreover, we propose a novel albedo map module based on the wavelet transform perception model to capture high-frequency texture information and generate a detailed albedo map consistent with face illumination. The detailed face geometry and albedo map are used to reconstruct a fine-grained 3D face without any labeled data. We have conducted extensive experiments that demonstrate the superiority of our method over existing state-of-the-art approaches for 3D face reconstruction on four public datasets including CelebA, LS3D, LFW, and NoW benchmark. The experimental results indicate that our method achieved higher accuracy and robustness, particularly of under the challenging conditions such as occlusion, large poses, and varying illuminations.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"23 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling the non-uniform retinal perception for viewport-dependent streaming of immersive video 建立非均匀视网膜感知模型，用于视口相关的沉浸式视频流

IF 3.9 3区计算机科学

Multimedia Systems Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01434-5

Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma

{"title":"Modeling the non-uniform retinal perception for viewport-dependent streaming of immersive video","authors":"Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma","doi":"10.1007/s00530-024-01434-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01434-5","url":null,"abstract":"Viewport-dependent streaming (VDS) of immersive video typically devises the attentive viewport (or FoV - Field of View) with high-quality compression but low-quality compressed content outside of it to reduce bandwidth. It, however, assumes uniform compression within the viewport, completely neglecting visual redundancy caused by non-uniform perception in central and peripheral vision areas when consuming the content using a head-mounted display (HMD). Our work models the unequal retinal perception within the instantaneous viewport and explores using it in the VDS system for non-uniform viewport compression to further save the data volume. To this end, we assess the just-noticeable-distortion moment of the rendered viewport frame by carefully adapting image quality-related compression factors like quantization stepsize q and/or spatial resolution s zone-by-zone to explicitly derive the imperceptible quality perception threshold with respect to the eccentric angle. Independent validations show that the visual perception of the immersive images with non-uniform FoV quality guided by our model is indistinguishable from that of images with default uniform FoV quality. Our model can be flexibly integrated with the tiling strategy in popular video codecs to facilitate non-uniform viewport compression in practical VDS systems for significant bandwidth reduction (e.g., about 40% reported in our experiments) at similar visual quality.\u0000","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"61 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0