JHU分级系统描述

Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak
{"title":"JHU分级系统描述","authors":"Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak","doi":"10.21437/IBERSPEECH.2018-49","DOIUrl":null,"url":null,"abstract":"We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the final speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"697 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"JHU Diarization System Description\",\"authors\":\"Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak\",\"doi\":\"10.21437/IBERSPEECH.2018-49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the final speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.\",\"PeriodicalId\":115963,\"journal\":{\"name\":\"IberSPEECH Conference\",\"volume\":\"697 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IberSPEECH Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/IBERSPEECH.2018-49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IBERSPEECH.2018-49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

提出了一种用于自由语音- rtve说话人特征化评价的JHU系统。该评估将西班牙语和广播音频结合在同一录音中,这是我们的系统以前从未测试过的条件。为了解决这个问题,我们的通用系统的管道,完全在Kaldi开发,包括声学特征提取,SAD,嵌入提取器,PLDA和聚类阶段。该管道用于开放和封闭两种情况(在评价计划中描述)。所有提出的解决方案都使用宽带数据(16KHz)和mfc作为输入。对于封闭条件,系统使用Albayzin2016数据训练DNN SAD。由于可用数据量少,i向量嵌入提取是该任务探索的唯一方法。PLDA训练利用Albayzin数据和AHC聚类(Agglomerative Hierarchical Clustering, AHC)来获得说话人分割。打开条件使用在关闭条件下获得的DNN SAD。提取了x-vector-basic、x-vector-factor、i-vector-basic和BNF-i-vector四种类型的嵌入。x-vector-basic是在增强Voxceleb1和Voxceleb2上训练的TDNN。x-vector- factors是在SRE12-micphn、MX6-micphn、VoxCeleb和SITW-dev-core上训练的因子- tdnn (TDNN-F)。在Voxceleb1和Voxceleb2数据上训练i-vector-basic(无增强)。bnf -i向量是用与x向量因子相同的数据训练的bnf -后验i向量。新场景的PLDA训练使用Albayzin2016数据。这四个系统在分数水平上融合在一起。AHC再一次计算出最终的说话人分割。我们在Albayzin2018 dev2数据中测试了我们的系统,并观察到SAD对于改善结果非常重要。此外,我们注意到x向量比i向量更好,这在之前的实验中已经观察到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
JHU Diarization System Description
We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the final speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信