Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-08-01 DOI:10.1109/TASL.2013.2260744

Michael Wohlmayr, F. Pernkopf

{"title":"Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference","authors":"Michael Wohlmayr, F. Pernkopf","doi":"10.1109/TASL.2013.2260744","DOIUrl":null,"url":null,"abstract":"Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"48 1","pages":"1742-1754"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260744","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2260744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus.

查看原文本刊更多论文

基于模型的多音高跟踪:模型自适应与推理

对噪声和干扰音频信号的鲁棒性是语音识别和音频分析技术面临的挑战之一。解决这一挑战的一个途径是单通道多源建模。阶乘隐马尔可夫模型(fhmm)能够对多个声源随时间相互作用的声学场景进行建模。虽然这些模型在特定任务上达到了良好的性能，但在许多领域的适用性仍然存在严重的局限性。本文对这些模型进行了推广，提高了它们的适用性。特别是，我们开发了一个类似于em的迭代自适应框架，该框架能够仅使用语音混合数据使模型参数适应特定情况(例如实际扬声器，增益，声学通道等)。目前，需要特定于源的数据来学习模型。fhmm中的推理是适应的重要组成部分。我们开发了基于观察似然修剪的有效方法。利用网格语料库对多音高跟踪任务的自适应和有效推理进行了实证评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.