{"title":"激励源和序列学习对口语识别任务的重要性","authors":"Jagabandhu Mishra, Soma Siddhartha, S. Prasanna","doi":"10.1109/NCC55593.2022.9806768","DOIUrl":null,"url":null,"abstract":"Spoken LID systems generally capture the long term temporal dynamic information present in the speech signal. To achieve that, sequence modeling techniques are used after the feature extraction process. But, the performance of the spoken LID system degrades in cross channel and noisy scenarios. From the literature, we can observe the benefit of excitation source information in noisy and cross-channel scenarios. Besides that, excitation features are also used as complementary evidence in spoken LID systems with spectral features. Motivated from this, an excitation based feature called integrated residual linear frequency cepstral coefficient (IRLFCC) has been proposed in this work. This work also provides a comparison between various deep learning based sequence modeling architectures towards capturing spoken language specific information. The experiments are performed using OLR2020 dataset. From the experiments, it can be observed that in the cross channel scenario, the proposed best system provides a relative improvement of 70.5% and 57.2% over the baseline in terms of $EER_{avg}$ and $C_{avg}$ respectively. Similarly, in the noisy scenario, the proposed best system provides a relative improvement of 37.8% and 45 % over the baseline system.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Importance of excitation source and sequence learning towards spoken language identification task\",\"authors\":\"Jagabandhu Mishra, Soma Siddhartha, S. Prasanna\",\"doi\":\"10.1109/NCC55593.2022.9806768\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spoken LID systems generally capture the long term temporal dynamic information present in the speech signal. To achieve that, sequence modeling techniques are used after the feature extraction process. But, the performance of the spoken LID system degrades in cross channel and noisy scenarios. From the literature, we can observe the benefit of excitation source information in noisy and cross-channel scenarios. Besides that, excitation features are also used as complementary evidence in spoken LID systems with spectral features. Motivated from this, an excitation based feature called integrated residual linear frequency cepstral coefficient (IRLFCC) has been proposed in this work. This work also provides a comparison between various deep learning based sequence modeling architectures towards capturing spoken language specific information. The experiments are performed using OLR2020 dataset. From the experiments, it can be observed that in the cross channel scenario, the proposed best system provides a relative improvement of 70.5% and 57.2% over the baseline in terms of $EER_{avg}$ and $C_{avg}$ respectively. Similarly, in the noisy scenario, the proposed best system provides a relative improvement of 37.8% and 45 % over the baseline system.\",\"PeriodicalId\":403870,\"journal\":{\"name\":\"2022 National Conference on Communications (NCC)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 National Conference on Communications (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC55593.2022.9806768\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Importance of excitation source and sequence learning towards spoken language identification task
Spoken LID systems generally capture the long term temporal dynamic information present in the speech signal. To achieve that, sequence modeling techniques are used after the feature extraction process. But, the performance of the spoken LID system degrades in cross channel and noisy scenarios. From the literature, we can observe the benefit of excitation source information in noisy and cross-channel scenarios. Besides that, excitation features are also used as complementary evidence in spoken LID systems with spectral features. Motivated from this, an excitation based feature called integrated residual linear frequency cepstral coefficient (IRLFCC) has been proposed in this work. This work also provides a comparison between various deep learning based sequence modeling architectures towards capturing spoken language specific information. The experiments are performed using OLR2020 dataset. From the experiments, it can be observed that in the cross channel scenario, the proposed best system provides a relative improvement of 70.5% and 57.2% over the baseline in terms of $EER_{avg}$ and $C_{avg}$ respectively. Similarly, in the noisy scenario, the proposed best system provides a relative improvement of 37.8% and 45 % over the baseline system.