Dhanya Eledath, Narasimha Rao Thurlapati, V. Pavithra, Tirthankar Banerjee, V. Ramasubramanian
{"title":"端到端语音识别的少量学习:支持集生成的架构变体","authors":"Dhanya Eledath, Narasimha Rao Thurlapati, V. Pavithra, Tirthankar Banerjee, V. Ramasubramanian","doi":"10.23919/eusipco55093.2022.9909613","DOIUrl":null,"url":null,"abstract":"In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.","PeriodicalId":231263,"journal":{"name":"2022 30th European Signal Processing Conference (EUSIPCO)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Few-shot learning for E2E speech recognition: architectural variants for support set generation\",\"authors\":\"Dhanya Eledath, Narasimha Rao Thurlapati, V. Pavithra, Tirthankar Banerjee, V. Ramasubramanian\",\"doi\":\"10.23919/eusipco55093.2022.9909613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.\",\"PeriodicalId\":231263,\"journal\":{\"name\":\"2022 30th European Signal Processing Conference (EUSIPCO)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 30th European Signal Processing Conference (EUSIPCO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/eusipco55093.2022.9909613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/eusipco55093.2022.9909613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Few-shot learning for E2E speech recognition: architectural variants for support set generation
In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.