Yang Bai;Meijing Gao;Shiyu Li;Ping Wang;Ning Guan;Haozheng Yin;Yonghao Yan
{"title":"IBFusion: An Infrared and Visible Image Fusion Method Based on Infrared Target Mask and Bimodal Feature Extraction Strategy","authors":"Yang Bai;Meijing Gao;Shiyu Li;Ping Wang;Ning Guan;Haozheng Yin;Yonghao Yan","doi":"10.1109/TMM.2024.3410113","DOIUrl":"10.1109/TMM.2024.3410113","url":null,"abstract":"The fusion of infrared (IR) and visible (VIS) images aims to capture complementary information from diverse sensors, resulting in a fused image that enhances the overall human perception of the scene. However, existing fusion methods face challenges preserving diverse feature information, leading to cross-modal interference, feature degradation, and detail loss in the fused image. To solve the above problems, this paper proposes an image fusion method based on the infrared target mask and bimodal feature extraction strategy, termed IBFusion. Firstly, we define an infrared target mask, employing it to retain crucial information from the source images in the fused result. Additionally, we devise a mixed loss function, encompassing content loss, gradient loss, and structure loss, to ensure the coherence of the fused image with the IR and VIS images. Then, the mask is introduced into the mixed loss function to guide feature extraction and unsupervised network optimization. Secondly, we create a bimodal feature extraction strategy and construct a Dual-channel Multi-scale Feature Extraction Module (DMFEM) to extract thermal target information from the IR image and background texture information from the VIS image. This module retains the complementary information of the two source images. Finally, we use the Feature Fusion Module (FFM) to fuse the features effectively, generating the fusion result. Experiments on three public datasets demonstrate that the fusion results of our method have prominent infrared targets and clear texture details. Both subjective and objective assessments are better than the other twelve advanced algorithms, proving our method's effectiveness.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10610-10622"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval","authors":"Jiayi Li;Min Jiang;Jun Kong;Xuefeng Tao;Xi Luo","doi":"10.1109/TMM.2024.3410129","DOIUrl":"10.1109/TMM.2024.3410129","url":null,"abstract":"Text-Based Person Retrieval (TBPR) aims to identify a particular individual within an extensive image gallery using text as the query. The principal challenge inherent in the TBPR task revolves around how to map cross-modal information to a potential common space and learn a generic representation. Previous methods have primarily focused on aligning singular text-image pairs, disregarding the inherent polymorphism within both images and natural language expressions for the same individual. Moreover, these methods have also ignored the impact of semantic polymorphism-based intra-modal data distribution on cross-modal matching. Recent methods employ cross-modal implicit information reconstruction to enhance inter-modal connections. However, the process of information reconstruction remains ambiguous. To address these issues, we propose the Learning Semantic Polymorphic Mapping (LSPM) framework, facilitated by the prowess of pre-trained cross-modal models. Firstly, to learn cross-modal information representations with better robustness, we design the Inter-modal Information Aggregation (Inter-IA) module to achieve cross-modal polymorphic mapping, fortifying the foundation of our information representations. Secondly, to attain a more concentrated intra-modal information representation based on semantic polymorphism, we design Intra-modal Information Aggregation (Intra-IA) module to further constrain the embeddings. Thirdly, to further explore the potential of cross-modal interactions within the model, we design the implicit reasoning module, Masked Information Guided Reconstruction (MIGR), with constraint guidance to elevate overall performance. Extensive experiments on both CUHK-PEDES and ICFG-PEDES datasets show that we achieve state-of-the-art results on Rank-1, mAP and mINP compared to existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10678-10691"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Liu;Jiwei Wei;Jie Zou;Peng Wang;Yang Yang;Heng Tao Shen
{"title":"Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective","authors":"Ke Liu;Jiwei Wei;Jie Zou;Peng Wang;Yang Yang;Heng Tao Shen","doi":"10.1109/TMM.2024.3410133","DOIUrl":"10.1109/TMM.2024.3410133","url":null,"abstract":"Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a \u0000<bold>t</b>\u0000wo-\u0000<bold>s</b>\u0000tream \u0000<bold>p</b>\u0000ooling \u0000<bold>c</b>\u0000hannel \u0000<bold>a</b>\u0000ttention (\u0000<bold>TsPCA</b>\u0000) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10623-10636"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenglong Cui;Da Yang;Hao Sheng;Sizhe Wang;Rongshan Chen;Ruixuan Cong;Wei Ke
{"title":"Triple Consistency for Transparent Cheating Problem in Light Field Depth Estimation","authors":"Zhenglong Cui;Da Yang;Hao Sheng;Sizhe Wang;Rongshan Chen;Ruixuan Cong;Wei Ke","doi":"10.1109/TMM.2024.3410139","DOIUrl":"10.1109/TMM.2024.3410139","url":null,"abstract":"Depth estimation extracting scenes' structural information is a key step in various light field(LF) applications. However, most existing depth estimation methods are based on the Lambertian assumption, which limits the application in non-Lambertian scenes. In this paper, we discover a unique transparent cheating problem for non-Lambertian scenes which can effectively spoof depth estimation algorithms based on photo consistency. It arises because the spatial consistency and the linear structure superimposed on the epipolar plane image form new spurious lines. Therefore, we propose centrifugal consistency and centripetal consistency for separating the depth information of multi-layer scenes and correcting the error due to the transparent cheating problem, respectively. By comparing the distributional characteristics and the number of minimal values of photo consistency and centrifugal consistency, non-Lambertian regions can be efficiently identified and initial depth estimates obtained. Then centripetal consistency is exploited to reject the projection from different layers and to address transparent cheating. By assigning decreasing weights radiating outward from the central view, pixels with a concentration of colors close to the central viewpoint are considered more significant. The problem of underestimating the depth of background caused by transparent cheating is effectively solved and corrected. Experiments on synthetic and real-world data show that our method can produce high-quality depth estimation under the transparency and the reflectivity of 90% to 20%. The proposed triple-consistency-based algorithm outperforms state-of-the-art LF depth estimation methods in terms of accuracy and robustness.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10651-10664"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Jiao;Huanqiang Zeng;Jing Chen;Chih-Hsien Hsia;Tianlei Wang;Kai-Kuang Ma
{"title":"Width-Adaptive CNN: Fast CU Partition Prediction for VVC Screen Content Coding","authors":"Chao Jiao;Huanqiang Zeng;Jing Chen;Chih-Hsien Hsia;Tianlei Wang;Kai-Kuang Ma","doi":"10.1109/TMM.2024.3410116","DOIUrl":"10.1109/TMM.2024.3410116","url":null,"abstract":"Screen content coding (SCC) in Versatile Video Coding (VVC) improves the coding efficiency of screen content videos (SCVs) significantly but results in high computational complexity due to the quad-tree plus multi-type tree (QTMT) structure of the coding unit (CU) partitioning. Therefore, we make the first attempt to reduce the encoding complexity from the perspective of CU partitioning for SCC in VVC. To this end, a fast CU partition prediction method is technically developed for VVC-SCC. First, to solve the problem of lacking sufficient SCC training data, SCVs are collected to establish a database containing CUs of various sizes and corresponding partition labels. Second, to determine the partition decision in advance, a novel WA-CNN model is proposed, which is capable of predicting two large CUs for VVC-SCC by adjusting the feature channels based on the size of input CU blocks. Finally, considering the imbalanced proportion of diverse partition decisions, a loss function with the weight that equalizes the contribution of imbalanced data is formulated to train the proposed WA-CNN model. Experimental results show that the proposed model reduces the SCC intra-encoding time by 35.65%\u0000<inline-formula><tex-math>${sim }$</tex-math></inline-formula>\u000038.31% with an average of 1.84%\u0000<inline-formula><tex-math>${sim }$</tex-math></inline-formula>\u00002.42% BDBR increase.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9372-9382"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyu Deng;Yushan Xie;Qi Wang;Jianjun Wang;Weijian Ruan;Wu Liu;Yong-Jin Liu
{"title":"CDKM: Common and Distinct Knowledge Mining Network With Content Interaction for Dense Captioning","authors":"Hongyu Deng;Yushan Xie;Qi Wang;Jianjun Wang;Weijian Ruan;Wu Liu;Yong-Jin Liu","doi":"10.1109/TMM.2024.3407695","DOIUrl":"10.1109/TMM.2024.3407695","url":null,"abstract":"The dense captioning task aims at detecting multiple salient regions of an image and describing them separately in natural language. Although significant advancements in the field of dense captioning have been made, there are still some limitations to existing methods in recent years. On the one hand, most dense captioning methods lack strong target detection capabilities and struggle to cover all relevant content when dealing with target-intensive images. On the other hand, current transformer-based methods are powerful but neglect the acquisition and utilization of contextual information, hindering the visual understanding of local areas. To address these issues, we propose a common and distinct knowledge-mining network with content interaction for the task of dense captioning. Our network has a knowledge mining mechanism that improves the detection of salient targets by capturing common and distinct knowledge from multi-scale features. We further propose a content interaction module that combines region features into a unique context based on their correlation. Our experiments on various benchmarks have shown that the proposed method outperforms the current state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10462-10473"},"PeriodicalIF":8.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Wang;Zhenwei Gao;Mengqun Han;Yang Yang;Heng Tao Shen
{"title":"Estimating the Semantics via Sector Embedding for Image-Text Retrieval","authors":"Zheng Wang;Zhenwei Gao;Mengqun Han;Yang Yang;Heng Tao Shen","doi":"10.1109/TMM.2024.3407664","DOIUrl":"10.1109/TMM.2024.3407664","url":null,"abstract":"Based on deterministic single-point embedding, most extant image-text retrieval methods only focus on the match of ground truth while suffering from one-to-many correspondence, where besides annotated positives, many similar instances of another modality should be retrieved by a given query. Recent solutions of probabilistic embedding and rectangle mapping still encounter some drawbacks, albeit their promising effectiveness at multiple matches. Meanwhile, the exploration of one-to-many correspondence is still insufficient. Therefore, this paper proposes a novel geometric representation to \u0000<underline>E</u>\u0000stimate the \u0000<underline>S</u>\u0000emantics of heterogeneous data via \u0000<underline>S</u>\u0000ector \u0000<underline>E</u>\u0000mbedding (dubbed \u0000<bold>ESSE</b>\u0000). Specifically, a given image/text can be projected as a sector, where its symmetric axis represents mean semantics and the aperture estimates uncertainty. Further, a sector matching loss is introduced to better handle the multiplicity by considering the sine of included angles as distance calculation, which encourages candidates to be contained by the apertures of a query sector. The experimental results on three widely used benchmarks CUB, Flickr30 K and MS-COCO reveal that sector embedding can achieve competitive performance on multiple matches and also improve the traditional ground-truth matching of the baselines. Additionally, we also verify the generalization to video-text retrieval on two extensively used datasets of MSRVTT and MSVD, and to text-based person retrieval on CUHK-PEDES. This superiority and effectiveness can also demonstrate that the bounded property of the aperture can better estimate semantic uncertainty when compared to prior remedies.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10342-10353"},"PeriodicalIF":8.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Logit Variated Product Quantization Based on Parts Interaction and Metric Learning With Knowledge Distillation for Fine-Grained Image Retrieval","authors":"Lei Ma;Xin Luo;Hanyu Hong;Fanman Meng;Qingbo Wu","doi":"10.1109/TMM.2024.3407661","DOIUrl":"10.1109/TMM.2024.3407661","url":null,"abstract":"Image retrieval with fine-grained categories is an extremely challenging task due to the high intraclass variance and low interclass variance. Most previous works have focused on localizing discriminative image regions in isolation, but have rarely exploited correlations across the different discriminative regions to alleviate intraclass differences. In addition, the intraclass compactness of embedding features is ensured by extra regularization terms that only exist during the training phase, which appear to generalize less well in the inference phase. Finally, the information granularity of the distance measure should distinguish subtle visual differences and the correlation between the embedding features and the quantized features should be maximized sufficiently. To address the above issues, we propose a logit variated product quantization method based on part interaction and metric learning with knowledge distillation for fine-grained image retrieval. Specifically, we introduce a causal context module into the deep navigator to generate discriminative regions and utilize a channelwise cross-part fusion transformer to model the part correlations while alleviating intraclass differences. Subsequently, we design a logit variation module based on a weighted sum scheme to further reduce the intraclass variance of the embedding features directly and enhance the learning power of the quantization model. Finally, we propose a novel product quantization loss based on metric learning and knowledge distillation to enhance the correlation between the embedding features and the quantized features and allow the quantization features to learn more knowledge from the embedding features. The experimental results on several fine-grained datasets demonstrate that the proposed method is superior to state-of-the-art fine-grained image retrieval methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10406-10419"},"PeriodicalIF":8.4,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PhotoStyle60: A Photographic Style Dataset for Photo Authorship Attribution and Photographic Style Transfer","authors":"Marco Cotogni;Marco Arazzi;Claudio Cusano","doi":"10.1109/TMM.2024.3408683","DOIUrl":"10.1109/TMM.2024.3408683","url":null,"abstract":"Photography, like painting, allows artists to express themselves through their unique style. In digital photography, this is achieved not only with the choice of the subject and the composition but also by means of post-processing operations. The automatic identification of a photographer from the style of a photo is a challenging task, for many reasons, including the lack of suitable datasets including photos taken by a diverse panel of photographers with a clear photographic style. In this paper we present PhotoStyle60, a new dataset including 5708 photographs from 60 professional and semi-professional photographers. Additionally, we selected a reduced version of the dataset, called PhotoStyle10 containing images from 10 clearly distinguishable experts. We designed the dataset to address two tasks in particular: photo authorship attribution and photographic style transfer. In the former, we conducted an extensive analysis of the dataset through several classification experiments. In the latter, we explored the potential of our dataset to transfer a photographer's style to images from the Five-K dataset. Additionally, we propose also a simple but effective multi-image style transfer method that uses multiple samples of the target style. A user study demonstrated that such a method was able to reach accurate results, preserving the semantic content of the source photograph with very few artifacts.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10573-10584"},"PeriodicalIF":8.4,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}