{"title":"面向维吾尔语语音识别的通道感知语音网络CAs-Net。","authors":"Jiang Zhang, Miaomiao Xu, Lianghui Xu, Yajing Ma","doi":"10.3390/s25123783","DOIUrl":null,"url":null,"abstract":"<p><p>This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs each frame's channel vector into a spatial structure and applies a rotation operation to explicitly model the local structural relationships within the channel dimension, thereby enhancing the encoder's contextual modeling capability; and (2) the Multi-Scale Depthwise Convolution Module (MSDCM), integrated within the Transformer framework, which leverages multi-branch depthwise separable convolutions and a lightweight self-attention mechanism to jointly capture multi-scale temporal patterns, thus improving the model's perception of compact articulation and complex rhythmic structures. Experiments conducted on a real Uyghur speech recognition dataset demonstrate that CAs-Net achieves the best performance across multiple subsets, with an average Word Error Rate (WER) of 5.72%, significantly outperforming existing approaches. These results validate the robustness and effectiveness of the proposed model under low-resource and noisy conditions.</p>","PeriodicalId":21698,"journal":{"name":"Sensors","volume":"25 12","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CAs-Net: A Channel-Aware Speech Network for Uyghur Speech Recognition.\",\"authors\":\"Jiang Zhang, Miaomiao Xu, Lianghui Xu, Yajing Ma\",\"doi\":\"10.3390/s25123783\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs each frame's channel vector into a spatial structure and applies a rotation operation to explicitly model the local structural relationships within the channel dimension, thereby enhancing the encoder's contextual modeling capability; and (2) the Multi-Scale Depthwise Convolution Module (MSDCM), integrated within the Transformer framework, which leverages multi-branch depthwise separable convolutions and a lightweight self-attention mechanism to jointly capture multi-scale temporal patterns, thus improving the model's perception of compact articulation and complex rhythmic structures. Experiments conducted on a real Uyghur speech recognition dataset demonstrate that CAs-Net achieves the best performance across multiple subsets, with an average Word Error Rate (WER) of 5.72%, significantly outperforming existing approaches. These results validate the robustness and effectiveness of the proposed model under low-resource and noisy conditions.</p>\",\"PeriodicalId\":21698,\"journal\":{\"name\":\"Sensors\",\"volume\":\"25 12\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sensors\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.3390/s25123783\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, ANALYTICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sensors","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3390/s25123783","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
CAs-Net: A Channel-Aware Speech Network for Uyghur Speech Recognition.
This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs each frame's channel vector into a spatial structure and applies a rotation operation to explicitly model the local structural relationships within the channel dimension, thereby enhancing the encoder's contextual modeling capability; and (2) the Multi-Scale Depthwise Convolution Module (MSDCM), integrated within the Transformer framework, which leverages multi-branch depthwise separable convolutions and a lightweight self-attention mechanism to jointly capture multi-scale temporal patterns, thus improving the model's perception of compact articulation and complex rhythmic structures. Experiments conducted on a real Uyghur speech recognition dataset demonstrate that CAs-Net achieves the best performance across multiple subsets, with an average Word Error Rate (WER) of 5.72%, significantly outperforming existing approaches. These results validate the robustness and effectiveness of the proposed model under low-resource and noisy conditions.
期刊介绍:
Sensors (ISSN 1424-8220) provides an advanced forum for the science and technology of sensors and biosensors. It publishes reviews (including comprehensive reviews on the complete sensors products), regular research papers and short notes. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced.