Transfer Learning Using Raw Waveform Sincnet for Robust Speaker Diarization

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8683023

Harishchandra Dubey, A. Sangwan, J. Hansen

{"title":"Transfer Learning Using Raw Waveform Sincnet for Robust Speaker Diarization","authors":"Harishchandra Dubey, A. Sangwan, J. Hansen","doi":"10.1109/ICASSP.2019.8683023","DOIUrl":null,"url":null,"abstract":"Speaker diarization tells who spoke and whenƒ in an audio stream. SincNet is a recently developed novel convolutional neural network (CNN) architecture where the first layer consists of parameterized sinc filters. Unlike conventional CNNs, SincNet take raw speech waveform as input. This paper leverages SincNet in vanilla transfer learning (VTL) setup. Out-domain data is used for training SincNet-VTL to perform frame-level speaker classification. Trained SincNet-VTL is later utilized as feature extractor for in-domain data. We investigated pooling (max, avg) strategies for deriving utterance-level embedding using frame-level features extracted from trained network. These utterance/segment level embedding are adopted as speaker models during clustering stage in diarization pipeline. We compared the proposed SincNet-VTL embedding with baseline i-vector features. We evaluated our approaches on two corpora, CRSS-PLTL and AMI. Results show the efficacy of trained SincNet-VTL for speaker-discriminative embedding even when trained on small amount of data. Proposed features achieved relative DER improvements of 19.12% and 52.07% for CRSS-PLTL and AMI data, respectively over baseline i-vectors.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"19 1","pages":"6296-6300"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8683023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Speaker diarization tells who spoke and whenƒ in an audio stream. SincNet is a recently developed novel convolutional neural network (CNN) architecture where the first layer consists of parameterized sinc filters. Unlike conventional CNNs, SincNet take raw speech waveform as input. This paper leverages SincNet in vanilla transfer learning (VTL) setup. Out-domain data is used for training SincNet-VTL to perform frame-level speaker classification. Trained SincNet-VTL is later utilized as feature extractor for in-domain data. We investigated pooling (max, avg) strategies for deriving utterance-level embedding using frame-level features extracted from trained network. These utterance/segment level embedding are adopted as speaker models during clustering stage in diarization pipeline. We compared the proposed SincNet-VTL embedding with baseline i-vector features. We evaluated our approaches on two corpora, CRSS-PLTL and AMI. Results show the efficacy of trained SincNet-VTL for speaker-discriminative embedding even when trained on small amount of data. Proposed features achieved relative DER improvements of 19.12% and 52.07% for CRSS-PLTL and AMI data, respectively over baseline i-vectors.

查看原文本刊更多论文

基于原始波形自网的稳健说话人特征化迁移学习

扬声器拨号告诉谁说话，何时在音频流。SincNet是最近开发的一种新型卷积神经网络(CNN)架构，其中第一层由参数化的sinc滤波器组成。与传统cnn不同，SincNet采用原始语音波形作为输入。本文在普通迁移学习(VTL)设置中利用了SincNet。域外数据用于训练SincNet-VTL进行帧级说话人分类。训练后的SincNet-VTL用作域内数据的特征提取器。我们研究了池化(max, avg)策略，利用从训练好的网络中提取的帧级特征来获得话语级嵌入。在分词管道的聚类阶段，采用这些话语/段级嵌入作为说话人模型。我们将所提出的SincNet-VTL嵌入与基线i向量特征进行了比较。我们在两个语料库上评估了我们的方法，CRSS-PLTL和AMI。结果表明，训练后的SincNet-VTL即使在少量数据上也能有效地进行说话人判别嵌入。与基线i向量相比，所提出的特征在CRSS-PLTL和AMI数据上的相对DER分别提高了19.12%和52.07%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量