IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention 用记忆网络和掩蔽注意力从跟踪角度反思视频句子接地问题
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453062
Zeyu Xiong;Daizong Liu;Xiang Fang;Xiaoye Qu;Jianfeng Dong;Jiahao Zhu;Keke Tang;Pan Zhou
{"title":"Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention","authors":"Zeyu Xiong;Daizong Liu;Xiang Fang;Xiaoye Qu;Jianfeng Dong;Jiahao Zhu;Keke Tang;Pan Zhou","doi":"10.1109/TMM.2024.3453062","DOIUrl":"10.1109/TMM.2024.3453062","url":null,"abstract":"Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11204-11218"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DBSR: Quadratic Conditional Diffusion Model for Blind Cardiac MRI Super-Resolution DBSR:用于心脏磁共振成像盲超分辨率的二次条件扩散模型
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453059
Defu Qiu;Yuhu Cheng;Kelvin K.L. Wong;Wenjun Zhang;Zhang Yi;Xuesong Wang
{"title":"DBSR: Quadratic Conditional Diffusion Model for Blind Cardiac MRI Super-Resolution","authors":"Defu Qiu;Yuhu Cheng;Kelvin K.L. Wong;Wenjun Zhang;Zhang Yi;Xuesong Wang","doi":"10.1109/TMM.2024.3453059","DOIUrl":"10.1109/TMM.2024.3453059","url":null,"abstract":"Cardiac magnetic resonance imaging (CMRI) can help experts quickly diagnose cardiovascular diseases. Due to the patient's breathing and slight movement during the magnetic resonance imaging scan, the obtained CMRI may be severely blurred, affecting the accuracy of clinical diagnosis. To address this issue, we propose the quadratic conditional diffusion model for blind CMRI super-resolution (DBSR). Specifically, we propose a conditional blur kernel noise predictor, which predicts the blur kernel from low-resolution images by the diffusion model, transforming the unknown blur kernel in low-resolution CMRI into a known one. Meanwhile, we design a novel conditional CMRI noise predictor, which uses the predicted blur kernel as prior knowledge to guide the diffusion model in reconstructing high-resolution CMRI. Furthermore, we propose a cascaded residual attention network feature extractor, which extracts feature information from CMRI low-resolution images for blur kernel prediction and SR reconstruction of CMRI images. Extensive experimental results indicate that our proposed DBSR achieves better blind super-resolution reconstruction results than several state-of-the-art baselines.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11358-11371"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LFS-Aware Surface Reconstruction From Unoriented 3D Point Clouds 从无定向三维点云重建具有 LFS 意识的曲面
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453050
Rao Fu;Kai Hormann;Pierre Alliez
{"title":"LFS-Aware Surface Reconstruction From Unoriented 3D Point Clouds","authors":"Rao Fu;Kai Hormann;Pierre Alliez","doi":"10.1109/TMM.2024.3453050","DOIUrl":"10.1109/TMM.2024.3453050","url":null,"abstract":"We present a novel approach for generating isotropic surface triangle meshes directly from unoriented 3D point clouds, with the mesh density adapting to the estimated local feature size (LFS). Popular reconstruction pipelines first reconstruct a dense mesh from the input point cloud and then apply remeshing to obtain an isotropic mesh. The sequential pipeline makes it hard to find a lower-density mesh while preserving more details. Instead, our approach reconstructs both an implicit function and an LFS-aware mesh sizing function directly from the input point cloud, which is then used to produce the final LFS-aware mesh without remeshing. We combine local curvature radius and shape diameter to estimate the LFS directly from the input point clouds. Additionally, we propose a new mesh solver to solve an implicit function whose zero level set delineates the surface without requiring normal orientation. The added value of our approach is generating isotropic meshes directly from 3D point clouds with an LFS-aware density, thus achieving a trade-off between geometric detail and mesh complexity. Our experiments also demonstrate the robustness of our method to noise, outliers, and missing data and can preserve sharp features for CAD point clouds.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11415-11427"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Prior Driven Resolution Rescaling Blocks for Intra Frame Coding 用于帧内编码的多优先级驱动分辨率重缩块
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453033
Peiying Wu;Shiwei Wang;Liquan Shen;Feifeng Wang;Zhaoyi Tian;Xia Hua
{"title":"Multi-Prior Driven Resolution Rescaling Blocks for Intra Frame Coding","authors":"Peiying Wu;Shiwei Wang;Liquan Shen;Feifeng Wang;Zhaoyi Tian;Xia Hua","doi":"10.1109/TMM.2024.3453033","DOIUrl":"10.1109/TMM.2024.3453033","url":null,"abstract":"Deep learning techniques are increasingly integrated into rescaling-based video compression frameworks and have shown great potential in improving compression efficiency. However, existing methods achieve limited performance because 1) they treat context priors generated by codec as independent sources of information, ignoring potential interactions between multiple priors in rescaling, which may not effectively facilitate compression; 2) they often employ a uniform sampling ratio across regions with varying content complexities, resulting in the loss of important information. To address the above two issues, this paper proposes a spatial multi-prior driven resolution rescaling framework for intra-frame coding, called MP-RRF, consisting of three sub-networks: a multi-prior driven network, a downscaling network, and an upscaling network. First, the multi-prior driven network employs complexity and similarity priors to smooth the unnecessarily complicated information while leveraging similarity and quality priors to produce high-fidelity complementary information. This interaction of complexity, similarity and quality priors ensures redundancy reduction and texture enhancement. Second, the downscaling network discriminatively processes components of different granularities to generate a compact, low-resolution image for encoding. The upscaling network aggregates a complementary set of contextual multi-scale features to reconstruct realistic details while combining variable receptive fields to suppress multi-scale compression artifacts and resampling noise. Extensive experiments show that our network achieves a significant 23.84% Bjøntegaard Delta Rate (BD-Rate) reduction under all-intra configuration compared to the codec anchor, offering the state-of-the-art coding performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11274-11289"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMC-NCA: Semantic-Guided Multi-Level Contrast for Semi-Supervised Temporal Action Segmentation SMC-NCA:用于半监督时态动作分割的语义引导多级对比技术
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3452980
Feixiang Zhou;Zheheng Jiang;Huiyu Zhou;Xuelong Li
{"title":"SMC-NCA: Semantic-Guided Multi-Level Contrast for Semi-Supervised Temporal Action Segmentation","authors":"Feixiang Zhou;Zheheng Jiang;Huiyu Zhou;Xuelong Li","doi":"10.1109/TMM.2024.3452980","DOIUrl":"10.1109/TMM.2024.3452980","url":null,"abstract":"Semi-supervised temporal action segmentation (SS-TAS) aims to perform frame-wise classification in long untrimmed videos, where only a fraction of videos in the training set have labels. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. However, learning the representation of each frame by unsupervised contrastive learning for action segmentation remains an open and challenging problem. In this paper, we propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations for SS-TAS. Specifically, for representation learning, SMC is first used to explore intra- and inter-information variations in a unified and contrastive way, based on action-specific semantic information and temporal information highlighting relations between actions. Then, the NCA module, which is responsible for enforcing spatial consistency between neighbourhoods centered at different frames to alleviate over-segmentation issues, works alongside SMC for semi-supervised learning (SSL). Our SMC outperforms the other state-of-the-art methods on three benchmarks, offering improvements of up to 17.8\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 and 12.6\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 in terms of Edit distance and accuracy, respectively. Additionally, the NCA unit results in significantly better segmentation performance in the presence of only 5\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 labelled videos. We also demonstrate the generalizability and effectiveness of the proposed method on our Parkinson's Disease Mouse Behaviour (PDMB) dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11386-11401"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Elaborate Teacher: Improved Semi-Supervised Object Detection With Rich Image Exploiting 精心设计的教师:利用丰富的图像开发改进半监督物体检测
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453040
Xi Yang;Qiubai Zhou;Ziyu Wei;Hong Liu;Nannan Wang;Xinbo Gao
{"title":"Elaborate Teacher: Improved Semi-Supervised Object Detection With Rich Image Exploiting","authors":"Xi Yang;Qiubai Zhou;Ziyu Wei;Hong Liu;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3453040","DOIUrl":"10.1109/TMM.2024.3453040","url":null,"abstract":"Semi-Supervised Object Detection (SSOD) has shown remarkable results by leveraging image pairs with a teacher-student framework. An excellent strong augmentation method can generate richer images and alleviate the influence of noise in pseudo-labels. However, existing data augmentation methods for SSOD do not consider instance-level information, thus, they cannot make full use of unlabeled data. Besides, the current teacher-student framework in SSOD solely relies on pseudo-labeling techniques, which may disregard some uncertain information. In this article, we introduce a new method called Elaborate Teacher which generates and exploits image pairs in a more refined manner. To enrich strongly augmented images, a novel data augmentation method called Information-Aware Mixup Representation (IAMR) is proposed. IAMR utilizes the teacher model's predictions as prior information and considers instance-level information, which can be seamlessly integrated with existing SSOD data augmentation methods. Furthermore, to fully exploit the information in unlabeled data, we propose the Enhanced Scale Consistency Regularization (ESCR), which considers the consistency from both semantic space and feature space. Elaborate Teacher introduces a fresh data augmentation method, complemented by consistency regularization, which boosts the performance of semi-supervised object detectors. Extensive experiments on the \u0000<italic>PASCAL VOC</i>\u0000 and \u0000<italic>MS-COCO</i>\u0000 datasets demonstrate the effectiveness of our method in leveraging unlabeled image information. Our method consistently outperforms the baseline method and improves mAP by 11.6% and 9.0% relative to the supervised baseline method when using 5% and 10% of labeled data on \u0000<italic>MS-COCO</i>\u0000, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11345-11357"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Discriminative Motion Models for Multiple Object Tracking 学习用于多目标跟踪的判别运动模型
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453057
Yi-Fan Li;Hong-Bing Ji;Wen-Bo Zhang;Yu-Kun Lai
{"title":"Learning Discriminative Motion Models for Multiple Object Tracking","authors":"Yi-Fan Li;Hong-Bing Ji;Wen-Bo Zhang;Yu-Kun Lai","doi":"10.1109/TMM.2024.3453057","DOIUrl":"10.1109/TMM.2024.3453057","url":null,"abstract":"Motion models are vital for solving multiple object tracking (MOT), which makes instance-level position predictions of targets to handle occlusions and noisy detections. Recent methods have proposed the use of Single Object Tracking (SOT) techniques to build motion models and unify the SOT tracker with the object detector into a single network for high-efficiency MOT. However, three feature incompatibility issues in the required features of this paradigm are ignored, leading to inferior performance. First, the object detector requires class-specific features to localize objects of pre-defined classes. Contrarily, target-specific features are required in SOT to track the target of interest with an unknown category. Second, MOT relies on intra-class differences to associate targets of the same identity (ID). On the other hand, the SOT trackers focus on inter-class differences to distinguish the tracking target from the background. Third, classification confidence is used to determine the existence of targets, which is obtained with category-related features and cannot accurately reveal the existence of targets in tracking scenes. To address these issues, we propose a novel Task-specific Feature Encoding Network (TFEN) to extract task-driven features for different sub-networks. Besides, we propose a novel Quadruplet State Sampling (QSS) strategy to form the training samples of the motion model and guide the SOT trackers to capture identity-discriminative features in position predictions. Finally, we propose an Existence Aware Tracking (EAT) algorithm by estimating the existence confidence of targets and re-considering low-scored predictions to recover missed targets. Experimental results indicate that the proposed Discriminative Motion Model-based tracker (DMMTracker) can effectively address these issues when employing SOT trackers as motion models, leading to highly competitive results on MOT benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11372-11385"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Language-Guided Dual-Modal Local Correspondence for Single Object Tracking 用于单个物体跟踪的语言引导双模局部对应技术
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-08-29 DOI: 10.1109/TMM.2024.3410141
Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu
{"title":"Language-Guided Dual-Modal Local Correspondence for Single Object Tracking","authors":"Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu","doi":"10.1109/TMM.2024.3410141","DOIUrl":"10.1109/TMM.2024.3410141","url":null,"abstract":"This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10637-10650"},"PeriodicalIF":8.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MAAN: Memory-Augmented Auto-Regressive Network for Text-Driven 3D Indoor Scene Generation MAAN:用于文本驱动三维室内场景生成的内存增强自动回归网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-08-26 DOI: 10.1109/TMM.2024.3443657
Zhaoda Ye;Yang Liu;Yuxin Peng
{"title":"MAAN: Memory-Augmented Auto-Regressive Network for Text-Driven 3D Indoor Scene Generation","authors":"Zhaoda Ye;Yang Liu;Yuxin Peng","doi":"10.1109/TMM.2024.3443657","DOIUrl":"10.1109/TMM.2024.3443657","url":null,"abstract":"The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both \u0000<italic>spatial relationships</i>\u0000 and \u0000<italic>object combinations</i>\u0000. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11057-11069"},"PeriodicalIF":8.4,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Weakly Supervised Text-to-Audio Grounding 实现弱监督文本到音频的接地
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-08-23 DOI: 10.1109/TMM.2024.3443614
Xuenan Xu;Ziyang Ma;Mengyue Wu;Kai Yu
{"title":"Towards Weakly Supervised Text-to-Audio Grounding","authors":"Xuenan Xu;Ziyang Ma;Mengyue Wu;Kai Yu","doi":"10.1109/TMM.2024.3443614","DOIUrl":"10.1109/TMM.2024.3443614","url":null,"abstract":"Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms previous WSTAG methods. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11126-11138"},"PeriodicalIF":8.4,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信