Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su
{"title":"用于微视频多标签分类的多模态渐进调制网络","authors":"Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su","doi":"10.1109/TMM.2024.3405724","DOIUrl":null,"url":null,"abstract":"Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10134-10144"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification\",\"authors\":\"Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su\",\"doi\":\"10.1109/TMM.2024.3405724\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10134-10144\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10572319/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10572319/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification
Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.