Yan Wang, Yunzhi Su, Wenhui Li, C. Yan, Bolun Zheng, Xuanya Li, Anjin Liu
{"title":"图像和文本匹配的语义嵌入不确定性学习","authors":"Yan Wang, Yunzhi Su, Wenhui Li, C. Yan, Bolun Zheng, Xuanya Li, Anjin Liu","doi":"10.1109/ICME55011.2023.00153","DOIUrl":null,"url":null,"abstract":"Image and text matching measures the semantic similarity for cross-modal retrieval. The core of this task is semantic embedding, which mines the intrinsic characteristics of visual and textual for discriminative representation. However, cross-modal ambiguity of image and text (the existence of one-to-many associations) is prone to semantic diversity. The mainstream approaches utilized the fixed point embedding to represent semantics, which ignored the embedding uncertainty caused by semantic diversity leading to incorrect results. To address this issue, we propose a novel Semantic Embedding Uncertainty Learning (SEUL), which represents the embedding uncertainty of image and text as Gaussian distributions and simultaneously learns the salient embedding (mean) and uncertainty (variance) in the common space. We design semantic uncertainty embedding for facilitating the robustness of the representation in the semantic diversity context. A combined objective function is proposed, which optimizes the semantic uncertainty and maintains discriminability to enhance cross-modal associations. Extended experiments are performed on two datasets to demonstrate advanced performance.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic Embedding Uncertainty Learning for Image and Text Matching\",\"authors\":\"Yan Wang, Yunzhi Su, Wenhui Li, C. Yan, Bolun Zheng, Xuanya Li, Anjin Liu\",\"doi\":\"10.1109/ICME55011.2023.00153\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image and text matching measures the semantic similarity for cross-modal retrieval. The core of this task is semantic embedding, which mines the intrinsic characteristics of visual and textual for discriminative representation. However, cross-modal ambiguity of image and text (the existence of one-to-many associations) is prone to semantic diversity. The mainstream approaches utilized the fixed point embedding to represent semantics, which ignored the embedding uncertainty caused by semantic diversity leading to incorrect results. To address this issue, we propose a novel Semantic Embedding Uncertainty Learning (SEUL), which represents the embedding uncertainty of image and text as Gaussian distributions and simultaneously learns the salient embedding (mean) and uncertainty (variance) in the common space. We design semantic uncertainty embedding for facilitating the robustness of the representation in the semantic diversity context. A combined objective function is proposed, which optimizes the semantic uncertainty and maintains discriminability to enhance cross-modal associations. Extended experiments are performed on two datasets to demonstrate advanced performance.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00153\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semantic Embedding Uncertainty Learning for Image and Text Matching
Image and text matching measures the semantic similarity for cross-modal retrieval. The core of this task is semantic embedding, which mines the intrinsic characteristics of visual and textual for discriminative representation. However, cross-modal ambiguity of image and text (the existence of one-to-many associations) is prone to semantic diversity. The mainstream approaches utilized the fixed point embedding to represent semantics, which ignored the embedding uncertainty caused by semantic diversity leading to incorrect results. To address this issue, we propose a novel Semantic Embedding Uncertainty Learning (SEUL), which represents the embedding uncertainty of image and text as Gaussian distributions and simultaneously learns the salient embedding (mean) and uncertainty (variance) in the common space. We design semantic uncertainty embedding for facilitating the robustness of the representation in the semantic diversity context. A combined objective function is proposed, which optimizes the semantic uncertainty and maintains discriminability to enhance cross-modal associations. Extended experiments are performed on two datasets to demonstrate advanced performance.