{"title":"基于内容的自模态和跨模态特征嵌入记忆的音乐图像检索","authors":"Takayuki Nakatsuka, Masahiro Hamasaki, Masataka Goto","doi":"10.1109/WACV56688.2023.00221","DOIUrl":null,"url":null,"abstract":"This paper describes a method based on deep metric learning for content-based cross-modal retrieval of a piece of music and its representative image (i.e., a music audio signal and its cover art image). We train music and image encoders so that the embeddings of a positive music-image pair lie close to each other, while those of a random pair lie far from each other, in a shared embedding space. Furthermore, we propose a mechanism called self- and cross-modal feature embedding memory, which stores both the music and image embeddings of any previous iterations in memory and enables the encoders to mine informative pairs for training. To perform such training, we constructed a dataset containing 78,325 music-image pairs. We demonstrate the effectiveness of the proposed mechanism on this dataset: specifically, our mechanism outperforms baseline methods by ×1.93 ∼ 3.38 for the mean reciprocal rank, ×2.19 ∼ 3.56 for recall@50, and 528 ∼ 891 ranks for the median rank.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Content-Based Music-Image Retrieval Using Self- and Cross-Modal Feature Embedding Memory\",\"authors\":\"Takayuki Nakatsuka, Masahiro Hamasaki, Masataka Goto\",\"doi\":\"10.1109/WACV56688.2023.00221\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a method based on deep metric learning for content-based cross-modal retrieval of a piece of music and its representative image (i.e., a music audio signal and its cover art image). We train music and image encoders so that the embeddings of a positive music-image pair lie close to each other, while those of a random pair lie far from each other, in a shared embedding space. Furthermore, we propose a mechanism called self- and cross-modal feature embedding memory, which stores both the music and image embeddings of any previous iterations in memory and enables the encoders to mine informative pairs for training. To perform such training, we constructed a dataset containing 78,325 music-image pairs. We demonstrate the effectiveness of the proposed mechanism on this dataset: specifically, our mechanism outperforms baseline methods by ×1.93 ∼ 3.38 for the mean reciprocal rank, ×2.19 ∼ 3.56 for recall@50, and 528 ∼ 891 ranks for the median rank.\",\"PeriodicalId\":270631,\"journal\":{\"name\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WACV56688.2023.00221\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV56688.2023.00221","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Content-Based Music-Image Retrieval Using Self- and Cross-Modal Feature Embedding Memory
This paper describes a method based on deep metric learning for content-based cross-modal retrieval of a piece of music and its representative image (i.e., a music audio signal and its cover art image). We train music and image encoders so that the embeddings of a positive music-image pair lie close to each other, while those of a random pair lie far from each other, in a shared embedding space. Furthermore, we propose a mechanism called self- and cross-modal feature embedding memory, which stores both the music and image embeddings of any previous iterations in memory and enables the encoders to mine informative pairs for training. To perform such training, we constructed a dataset containing 78,325 music-image pairs. We demonstrate the effectiveness of the proposed mechanism on this dataset: specifically, our mechanism outperforms baseline methods by ×1.93 ∼ 3.38 for the mean reciprocal rank, ×2.19 ∼ 3.56 for recall@50, and 528 ∼ 891 ranks for the median rank.