Fugui Fan , Yuting Su , Yun Liu , Peiguang Jing , Kaihua Qu , Yu Liu
{"title":"用于微视频多标签分类的多模态深度分层语义对齐矩阵因式分解方法","authors":"Fugui Fan , Yuting Su , Yun Liu , Peiguang Jing , Kaihua Qu , Yu Liu","doi":"10.1016/j.ipm.2024.103798","DOIUrl":null,"url":null,"abstract":"<div><p>As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the consensus features aggregated from all modalities to train a final multi-label predictor, while overlooking fine-grained semantic dependencies between modality and label domains. To address this problem, we present a novel multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method, which is devoted to bridging the dual-domain semantic discrepancies and the inter-modal heterogeneity gap for solving the multi-label classification task of micro-videos. Specifically, we utilize deep matrix factorization to individually explore the hierarchical modality-specific representations. A series of semantic embeddings is introduced to facilitate latent semantic interactions between modality-specific representations and label features in a layerwise manner. To further improve the representation ability of each modality, we leverage underlying correlation structures among instances to adequately mine intra-modal complementary attributes, and maximize the inter-modal alignment by aggregating consensus attributes in an optimal permutation. The experimental results conducted on the MTSVRC and VidOR datasets have demonstrated that our DHSAMF outperforms other state-of-the-art methods by nearly 3% and 4% improvements in terms of the AP metric.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classification\",\"authors\":\"Fugui Fan , Yuting Su , Yun Liu , Peiguang Jing , Kaihua Qu , Yu Liu\",\"doi\":\"10.1016/j.ipm.2024.103798\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the consensus features aggregated from all modalities to train a final multi-label predictor, while overlooking fine-grained semantic dependencies between modality and label domains. To address this problem, we present a novel multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method, which is devoted to bridging the dual-domain semantic discrepancies and the inter-modal heterogeneity gap for solving the multi-label classification task of micro-videos. Specifically, we utilize deep matrix factorization to individually explore the hierarchical modality-specific representations. A series of semantic embeddings is introduced to facilitate latent semantic interactions between modality-specific representations and label features in a layerwise manner. To further improve the representation ability of each modality, we leverage underlying correlation structures among instances to adequately mine intra-modal complementary attributes, and maximize the inter-modal alignment by aggregating consensus attributes in an optimal permutation. The experimental results conducted on the MTSVRC and VidOR datasets have demonstrated that our DHSAMF outperforms other state-of-the-art methods by nearly 3% and 4% improvements in terms of the AP metric.</p></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324001572\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324001572","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classification
As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the consensus features aggregated from all modalities to train a final multi-label predictor, while overlooking fine-grained semantic dependencies between modality and label domains. To address this problem, we present a novel multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method, which is devoted to bridging the dual-domain semantic discrepancies and the inter-modal heterogeneity gap for solving the multi-label classification task of micro-videos. Specifically, we utilize deep matrix factorization to individually explore the hierarchical modality-specific representations. A series of semantic embeddings is introduced to facilitate latent semantic interactions between modality-specific representations and label features in a layerwise manner. To further improve the representation ability of each modality, we leverage underlying correlation structures among instances to adequately mine intra-modal complementary attributes, and maximize the inter-modal alignment by aggregating consensus attributes in an optimal permutation. The experimental results conducted on the MTSVRC and VidOR datasets have demonstrated that our DHSAMF outperforms other state-of-the-art methods by nearly 3% and 4% improvements in terms of the AP metric.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.