Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong
{"title":"多语义跨模态联合嵌入","authors":"Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong","doi":"10.1109/CogMI50398.2020.00028","DOIUrl":null,"url":null,"abstract":"Textual-visual cross-modal retrieval has been an active research area in both computer vision and natural language processing communities. Most existing works learn a joint embedding model that maps raw text-image pairs onto a joint latent representation space in which the similarity between textual embeddings and visual embeddings can be computed and compared, without leveraging diverse semantics. This paper presents a general framework to study and evaluate the impact of diverse semantics extracted from the multi-modal input data on the quality and performance of joint embedding learning. We identify different ways that conventional textual features, such as TFIDF term frequency semantics and image category semantics, can be combined with neural features to further boost the efficiency of joint embedding learning. Experiments on the benchmark dataset Recipe1M demonstrates that existing representative cross-modal joint embedding approaches enhanced with diverse semantics in both raw inputs and joint embedding loss optimization can effectively boost their cross-modal retrieval performance.","PeriodicalId":360326,"journal":{"name":"2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Cross-Modal Joint Embedding with Diverse Semantics\",\"authors\":\"Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong\",\"doi\":\"10.1109/CogMI50398.2020.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Textual-visual cross-modal retrieval has been an active research area in both computer vision and natural language processing communities. Most existing works learn a joint embedding model that maps raw text-image pairs onto a joint latent representation space in which the similarity between textual embeddings and visual embeddings can be computed and compared, without leveraging diverse semantics. This paper presents a general framework to study and evaluate the impact of diverse semantics extracted from the multi-modal input data on the quality and performance of joint embedding learning. We identify different ways that conventional textual features, such as TFIDF term frequency semantics and image category semantics, can be combined with neural features to further boost the efficiency of joint embedding learning. Experiments on the benchmark dataset Recipe1M demonstrates that existing representative cross-modal joint embedding approaches enhanced with diverse semantics in both raw inputs and joint embedding loss optimization can effectively boost their cross-modal retrieval performance.\",\"PeriodicalId\":360326,\"journal\":{\"name\":\"2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CogMI50398.2020.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogMI50398.2020.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cross-Modal Joint Embedding with Diverse Semantics
Textual-visual cross-modal retrieval has been an active research area in both computer vision and natural language processing communities. Most existing works learn a joint embedding model that maps raw text-image pairs onto a joint latent representation space in which the similarity between textual embeddings and visual embeddings can be computed and compared, without leveraging diverse semantics. This paper presents a general framework to study and evaluate the impact of diverse semantics extracted from the multi-modal input data on the quality and performance of joint embedding learning. We identify different ways that conventional textual features, such as TFIDF term frequency semantics and image category semantics, can be combined with neural features to further boost the efficiency of joint embedding learning. Experiments on the benchmark dataset Recipe1M demonstrates that existing representative cross-modal joint embedding approaches enhanced with diverse semantics in both raw inputs and joint embedding loss optimization can effectively boost their cross-modal retrieval performance.