IEEE Transactions on Multimedia最新文献

筛选
英文 中文
VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521696
Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma
{"title":"VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning","authors":"Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma","doi":"10.1109/TMM.2024.3521696","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521696","url":null,"abstract":"Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"515-530"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network 基于图变换神经网络的神经形态视觉运动分割
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521662
Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri
{"title":"Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network","authors":"Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri","doi":"10.1109/TMM.2024.3521662","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521662","url":null,"abstract":"Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (<italic>IoU</i>%) and detection rate (<italic>DR</i>%), respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"385-400"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10812712","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Modality Semantic Consistency Learning for Visible-Infrared Person Re-Identification
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521843
Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang
{"title":"Cross-Modality Semantic Consistency Learning for Visible-Infrared Person Re-Identification","authors":"Min Liu;Zhu Zhang;Yuan Bian;Xueping Wang;Yeqing Sun;Baida Zhang;Yaonan Wang","doi":"10.1109/TMM.2024.3521843","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521843","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"568-580"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform DNP-AUT:基于双层非均匀分割和自适应U变换的图像压缩
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521853
Yumo Zhang;Zhanchuan Cai
{"title":"DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform","authors":"Yumo Zhang;Zhanchuan Cai","doi":"10.1109/TMM.2024.3521853","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521853","url":null,"abstract":"To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"249-262"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition 基于关系探索的视觉变换行人属性识别
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521677
Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li
{"title":"Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition","authors":"Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li","doi":"10.1109/TMM.2024.3521677","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521677","url":null,"abstract":"Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the <italic>WACV 2023 UPAR Challenge</i>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"198-208"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network 多模态判别稀疏编码驱动的RGB-D分类网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521720
Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li
{"title":"MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network","authors":"Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li","doi":"10.1109/TMM.2024.3521720","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521720","url":null,"abstract":"In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"442-454"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Strategy Prompt Reasoning for Emotional Support Conversation 情感支持对话的动态策略提示推理
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521669
Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang
{"title":"Dynamic Strategy Prompt Reasoning for Emotional Support Conversation","authors":"Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang","doi":"10.1109/TMM.2024.3521669","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521669","url":null,"abstract":"An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"108-119"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation 跨模态认知共识引导的视听分割
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521746
Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li
{"title":"Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation","authors":"Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li","doi":"10.1109/TMM.2024.3521746","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521746","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a <italic>Global</i> semantic label in each sequence, but the video frame covers multiple semantic objects across different <italic>Local</i> regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"209-223"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset 基于极化状态注意力去雾网络的模拟极化雾数据集
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521827
Sijia Wen;Yinqiang Zheng;Feng Lu
{"title":"Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset","authors":"Sijia Wen;Yinqiang Zheng;Feng Lu","doi":"10.1109/TMM.2024.3521827","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521827","url":null,"abstract":"Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"263-274"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor 语义引导的可区别性增强特征检测器和描述符
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521748
Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li
{"title":"SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor","authors":"Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li","doi":"10.1109/TMM.2024.3521748","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521748","url":null,"abstract":"Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"275-286"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信