A Study on the Performance Evaluation of Machine Learning Models for Phoneme Classification

International Conference on Machine Learning and Computing Pub Date : 2019-02-22 DOI:10.1145/3318299.3318385

Ali Shariq Imran, Abdolreza Sabzi Shahrebabaki, Negar Olfati, T. Svendsen

{"title":"A Study on the Performance Evaluation of Machine Learning Models for Phoneme Classification","authors":"Ali Shariq Imran, Abdolreza Sabzi Shahrebabaki, Negar Olfati, T. Svendsen","doi":"10.1145/3318299.3318385","DOIUrl":null,"url":null,"abstract":"This paper provides a comparative performance analysis of both shallow and deep machine learning classifiers for speech recognition task using frame-level phoneme classification. Phoneme recognition is still a fundamental and equally crucial initial step toward automatic speech recognition (ASR) systems. Often conventional classifiers perform exceptionally well on domain-specific ASR systems having a limited set of vocabulary and training data in contrast to deep learning approaches. It is thus imperative to evaluate performance of a system using deep artificial networks in terms of correctly recognizing atomic speech units, i.e., phonemes in this case with conventional state-of-the-art machine learning classifiers. Two deep learning models - DNN and LSTM with multiple configuration architectures by varying the number of layers and the number of neurons in each layer on the OLLO speech corpora along with six shallow machine learning classifiers for Filterbank acoustic features are thoroughly studied.\n Additionally, features with three and ten frames temporal context are computed and compared with no-context features for different models. The classifier's performance is evaluated in terms of precision, recall, and F1 score for 14 consonants and 10 vowels classes for 10 speakers with 4 different dialects. High classification accuracy of 93% and 95% F1 score is obtained with DNN and LSTM networks respectively on context-dependent features for 3-hidden layers containing 1024 nodes each. SVM surprisingly obtained even a higher classification score of 96.13% and a misclassification error of less than 5% for consonants and 4% for vowels.","PeriodicalId":164987,"journal":{"name":"International Conference on Machine Learning and Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318299.3318385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper provides a comparative performance analysis of both shallow and deep machine learning classifiers for speech recognition task using frame-level phoneme classification. Phoneme recognition is still a fundamental and equally crucial initial step toward automatic speech recognition (ASR) systems. Often conventional classifiers perform exceptionally well on domain-specific ASR systems having a limited set of vocabulary and training data in contrast to deep learning approaches. It is thus imperative to evaluate performance of a system using deep artificial networks in terms of correctly recognizing atomic speech units, i.e., phonemes in this case with conventional state-of-the-art machine learning classifiers. Two deep learning models - DNN and LSTM with multiple configuration architectures by varying the number of layers and the number of neurons in each layer on the OLLO speech corpora along with six shallow machine learning classifiers for Filterbank acoustic features are thoroughly studied. Additionally, features with three and ten frames temporal context are computed and compared with no-context features for different models. The classifier's performance is evaluated in terms of precision, recall, and F1 score for 14 consonants and 10 vowels classes for 10 speakers with 4 different dialects. High classification accuracy of 93% and 95% F1 score is obtained with DNN and LSTM networks respectively on context-dependent features for 3-hidden layers containing 1024 nodes each. SVM surprisingly obtained even a higher classification score of 96.13% and a misclassification error of less than 5% for consonants and 4% for vowels.

查看原文本刊更多论文

音素分类机器学习模型的性能评价研究

本文对基于帧级音素分类的语音识别任务中浅层和深度机器学习分类器的性能进行了比较分析。音素识别仍然是自动语音识别(ASR)系统的基础和同样重要的初始步骤。与深度学习方法相比，传统分类器通常在具有有限词汇和训练数据集的特定领域ASR系统上表现得非常好。因此，必须评估使用深度人工网络的系统在正确识别原子语音单位方面的性能，即在这种情况下使用传统的最先进的机器学习分类器的音素。通过改变OLLO语音语料库上的层数和每层神经元数量，深入研究了具有多个配置架构的深度学习模型DNN和LSTM，以及用于Filterbank声学特征的六个浅层机器学习分类器。此外，计算了具有三帧和十帧时间上下文的特征，并与不同模型的无上下文特征进行了比较。分类器的性能是根据4种不同方言的10个说话者的14个辅音和10个元音类别的精度、召回率和F1分数来评估的。在包含1024个节点的3个隐藏层中，DNN和LSTM网络在上下文相关特征上分别获得了93%和95%的F1分数的分类准确率。SVM出人意料地获得了更高的分类分数96.13%，辅音和元音的误分类错误率分别不到5%和4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Machine Learning and Computing

自引率

0.00%

发文量