Jongsu Youn , Dae Ung Jo , Seungmo Seo , Sukhyun Kim , Jongwon Choi
{"title":"Generating visual-adaptive audio representation for audio recognition","authors":"Jongsu Youn , Dae Ung Jo , Seungmo Seo , Sukhyun Kim , Jongwon Choi","doi":"10.1016/j.patrec.2025.03.020","DOIUrl":null,"url":null,"abstract":"<div><div>We propose “<em>Visual-adaptive Audio Spectrogram Generation</em>” (VASG), which is an innovative audio feature generation method preserving the Mel-spectrogram’s structure while enhancing its own discriminability. VASG maintains the spatio-temporal information of the Mel-spectrogram without degrading the performance of existing audio recognition and improves intra-class discriminability by incorporating the relational knowledge of images. VASG incorporates images only during the training phase, and once trained, VASG can be utilized as a converter that takes an input Mel-spectrogram and outputs an enhanced Mel-spectrogram, improving the discriminability of audio spectrograms without requiring further training during application. To effectively increase the discriminability of the encoded audio feature, we introduce a novel audio-visual correlation learning loss, named “Batch-wise Correlation Transfer” loss, that aligns inter-correlation between audio and visual modality. When applying pre-trained VASG to convert environmental sound classification benchmarks, we observed performance improvements in various audio classification models. Using the enhanced Mel-spectrograms produced by VASG, as opposed to the original Mel-spectrogram input, led to performance gains in recent state-of-the-art models, with accuracy increases of up to 4.27%.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 65-71"},"PeriodicalIF":3.9000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525001126","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
We propose “Visual-adaptive Audio Spectrogram Generation” (VASG), which is an innovative audio feature generation method preserving the Mel-spectrogram’s structure while enhancing its own discriminability. VASG maintains the spatio-temporal information of the Mel-spectrogram without degrading the performance of existing audio recognition and improves intra-class discriminability by incorporating the relational knowledge of images. VASG incorporates images only during the training phase, and once trained, VASG can be utilized as a converter that takes an input Mel-spectrogram and outputs an enhanced Mel-spectrogram, improving the discriminability of audio spectrograms without requiring further training during application. To effectively increase the discriminability of the encoded audio feature, we introduce a novel audio-visual correlation learning loss, named “Batch-wise Correlation Transfer” loss, that aligns inter-correlation between audio and visual modality. When applying pre-trained VASG to convert environmental sound classification benchmarks, we observed performance improvements in various audio classification models. Using the enhanced Mel-spectrograms produced by VASG, as opposed to the original Mel-spectrogram input, led to performance gains in recent state-of-the-art models, with accuracy increases of up to 4.27%.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.