{"title":"通过跨注意力上下文转换器与模态协作学习实现图像-文本多模态分类","authors":"Qianyao Shi, Wanru Xu, Zhenjiang Miao","doi":"10.1117/1.jei.33.4.043042","DOIUrl":null,"url":null,"abstract":"Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"2 4 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning\",\"authors\":\"Qianyao Shi, Wanru Xu, Zhenjiang Miao\",\"doi\":\"10.1117/1.jei.33.4.043042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.\",\"PeriodicalId\":54843,\"journal\":{\"name\":\"Journal of Electronic Imaging\",\"volume\":\"2 4 1\",\"pages\":\"\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Electronic Imaging\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1117/1.jei.33.4.043042\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.4.043042","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning
Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.
期刊介绍:
The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.