{"title":"利用挤压和激励改进音频嵌入:引入SaEENet","authors":"Andrés Carofilis , Laura Fernández-Robles , Enrique Alegre , Eduardo Fidalgo","doi":"10.1016/j.knosys.2025.113875","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we propose SaEENet, a novel neural network architecture to generate richer embeddings based on those generated by a pre-trained WavLM-large model and a set of convolutional layers fed by MFCCs. We employ 1D depthwise separable convolutions and GRU layers, and, to the best of our knowledge, we introduce for the first time the use of squeeze-and-excitation (SE) blocks for audio embedding weighting. The use of SE allows the model to assign a higher or lower relevance to each embedding generated from small audio segments and thus discard information generated from voiceless segments or segments with non-relevant information. In addition, we evaluated three different approaches for SE blocks to determine the most useful for the selected tasks. SaEENet outperforms similar models, such as the MEWHEV model, in the language identification, accent identification, and speaker identification tasks, achieving an improvement of 0.9%, 1.41%, and 4.01%, respectively, using 31.73% fewer trainable parameters. The results presented show that individual embeddings have varying effects on performance and that the integration of weighting mechanisms in SaEENet enhances accuracy in several speech classification tasks, highlighting the value of this approach for future applications.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"324 ","pages":"Article 113875"},"PeriodicalIF":7.6000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving audio embeddings with squeeze-and-excitation: Introducing SaEENet\",\"authors\":\"Andrés Carofilis , Laura Fernández-Robles , Enrique Alegre , Eduardo Fidalgo\",\"doi\":\"10.1016/j.knosys.2025.113875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In this paper, we propose SaEENet, a novel neural network architecture to generate richer embeddings based on those generated by a pre-trained WavLM-large model and a set of convolutional layers fed by MFCCs. We employ 1D depthwise separable convolutions and GRU layers, and, to the best of our knowledge, we introduce for the first time the use of squeeze-and-excitation (SE) blocks for audio embedding weighting. The use of SE allows the model to assign a higher or lower relevance to each embedding generated from small audio segments and thus discard information generated from voiceless segments or segments with non-relevant information. In addition, we evaluated three different approaches for SE blocks to determine the most useful for the selected tasks. SaEENet outperforms similar models, such as the MEWHEV model, in the language identification, accent identification, and speaker identification tasks, achieving an improvement of 0.9%, 1.41%, and 4.01%, respectively, using 31.73% fewer trainable parameters. The results presented show that individual embeddings have varying effects on performance and that the integration of weighting mechanisms in SaEENet enhances accuracy in several speech classification tasks, highlighting the value of this approach for future applications.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"324 \",\"pages\":\"Article 113875\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125009219\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125009219","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improving audio embeddings with squeeze-and-excitation: Introducing SaEENet
In this paper, we propose SaEENet, a novel neural network architecture to generate richer embeddings based on those generated by a pre-trained WavLM-large model and a set of convolutional layers fed by MFCCs. We employ 1D depthwise separable convolutions and GRU layers, and, to the best of our knowledge, we introduce for the first time the use of squeeze-and-excitation (SE) blocks for audio embedding weighting. The use of SE allows the model to assign a higher or lower relevance to each embedding generated from small audio segments and thus discard information generated from voiceless segments or segments with non-relevant information. In addition, we evaluated three different approaches for SE blocks to determine the most useful for the selected tasks. SaEENet outperforms similar models, such as the MEWHEV model, in the language identification, accent identification, and speaker identification tasks, achieving an improvement of 0.9%, 1.41%, and 4.01%, respectively, using 31.73% fewer trainable parameters. The results presented show that individual embeddings have varying effects on performance and that the integration of weighting mechanisms in SaEENet enhances accuracy in several speech classification tasks, highlighting the value of this approach for future applications.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.