{"title":"C-SupConGAN: Using Contrastive Learning and Trained Data Features for Audio-to-Image Generation","authors":"Haechun Chung, Jong-Kook Kim","doi":"10.1145/3582099.3582121","DOIUrl":null,"url":null,"abstract":"In this paper, the audio-to-image generation problem is investigated, where appropriate images are generated from the audio input. A previous study, Cross-Modal Contrastive Representation Learning (CMCRL), trained using both audios and images to extract useful audio features for audio-to-image generation. The CMCRL upgraded the Generative Adversarial Networks (GAN) to achieve high performance in the generation learning phase, but the GAN showed training instability. In this paper, the C-SupConGAN that uses the conditional supervised contrastive loss (C-SupCon loss) is proposed. C-SupConGAN enhances the conditional contrastive loss (2C loss) of the Contrastive GAN (ContraGAN) that considers data-to-data relationships and data-to-class relationships in the discriminator. The audio and image embeddings extracted from the encoder pre-trained using CMCRL is used to further extend the C-SupCon loss. The extended C-SupCon loss additionally considers relations information between data embedding and the corresponding audio embedding (data-to-source relationships) or between data embedding and the corresponding image embedding (data-to-target relationships). Extensive experiments show that the proposed method improved performance, generates higher quality images for audio-to-image generation than previous research, and effectively alleviates the training collapse of GAN.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"143 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, the audio-to-image generation problem is investigated, where appropriate images are generated from the audio input. A previous study, Cross-Modal Contrastive Representation Learning (CMCRL), trained using both audios and images to extract useful audio features for audio-to-image generation. The CMCRL upgraded the Generative Adversarial Networks (GAN) to achieve high performance in the generation learning phase, but the GAN showed training instability. In this paper, the C-SupConGAN that uses the conditional supervised contrastive loss (C-SupCon loss) is proposed. C-SupConGAN enhances the conditional contrastive loss (2C loss) of the Contrastive GAN (ContraGAN) that considers data-to-data relationships and data-to-class relationships in the discriminator. The audio and image embeddings extracted from the encoder pre-trained using CMCRL is used to further extend the C-SupCon loss. The extended C-SupCon loss additionally considers relations information between data embedding and the corresponding audio embedding (data-to-source relationships) or between data embedding and the corresponding image embedding (data-to-target relationships). Extensive experiments show that the proposed method improved performance, generates higher quality images for audio-to-image generation than previous research, and effectively alleviates the training collapse of GAN.