{"title":"多目标ctc -注意混合译码器联合音素-字素识别","authors":"Shreekantha Nadig, V. Ramasubramanian, Sachit Rao","doi":"10.1109/SPCOM50965.2020.9179603","DOIUrl":null,"url":null,"abstract":"In traditional Automatic Speech Recognition (ASR) systems, such as HMM-based architectures, words are predicted using either phonemes or graphemes as sub-word units. In this paper, we explore such joint phoneme-grapheme decoding using an Encoder-Decoder network with hybrid Connectionist Temporal Classification (CTC) and Attention mechanism. The Encoder network is shared between two Attentional Decoders which individually learn to predict phonemes and graphemes from a unique Encoder representation. This Encoder and multi-decoder network is trained in a multi-task setting to minimize the prediction error for both phoneme and grapheme sequences. We also implement the phoneme decoder at an intermediate layer of Encoder and demonstrate performance benefits to such an architecture. By carrying out various experiments on different architectural choices, we demonstrate, using the TIMIT and Librispeech 100 hours datasets, that with this approach, an improvement in performance than the baseline independent phoneme and grapheme recognition systems can be achieved.","PeriodicalId":208527,"journal":{"name":"2020 International Conference on Signal Processing and Communications (SPCOM)","volume":"236 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Multi-target hybrid CTC-Attentional Decoder for joint phoneme-grapheme recognition\",\"authors\":\"Shreekantha Nadig, V. Ramasubramanian, Sachit Rao\",\"doi\":\"10.1109/SPCOM50965.2020.9179603\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In traditional Automatic Speech Recognition (ASR) systems, such as HMM-based architectures, words are predicted using either phonemes or graphemes as sub-word units. In this paper, we explore such joint phoneme-grapheme decoding using an Encoder-Decoder network with hybrid Connectionist Temporal Classification (CTC) and Attention mechanism. The Encoder network is shared between two Attentional Decoders which individually learn to predict phonemes and graphemes from a unique Encoder representation. This Encoder and multi-decoder network is trained in a multi-task setting to minimize the prediction error for both phoneme and grapheme sequences. We also implement the phoneme decoder at an intermediate layer of Encoder and demonstrate performance benefits to such an architecture. By carrying out various experiments on different architectural choices, we demonstrate, using the TIMIT and Librispeech 100 hours datasets, that with this approach, an improvement in performance than the baseline independent phoneme and grapheme recognition systems can be achieved.\",\"PeriodicalId\":208527,\"journal\":{\"name\":\"2020 International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"236 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM50965.2020.9179603\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM50965.2020.9179603","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-target hybrid CTC-Attentional Decoder for joint phoneme-grapheme recognition
In traditional Automatic Speech Recognition (ASR) systems, such as HMM-based architectures, words are predicted using either phonemes or graphemes as sub-word units. In this paper, we explore such joint phoneme-grapheme decoding using an Encoder-Decoder network with hybrid Connectionist Temporal Classification (CTC) and Attention mechanism. The Encoder network is shared between two Attentional Decoders which individually learn to predict phonemes and graphemes from a unique Encoder representation. This Encoder and multi-decoder network is trained in a multi-task setting to minimize the prediction error for both phoneme and grapheme sequences. We also implement the phoneme decoder at an intermediate layer of Encoder and demonstrate performance benefits to such an architecture. By carrying out various experiments on different architectural choices, we demonstrate, using the TIMIT and Librispeech 100 hours datasets, that with this approach, an improvement in performance than the baseline independent phoneme and grapheme recognition systems can be achieved.