ODESSA at Albayzin Speaker Diarization Challenge 2018

IberSPEECH Conference Pub Date : 2018-11-21 DOI:10.21437/IBERSPEECH.2018-43

Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans

引用次数: 7

Abstract

This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.

查看原文本刊更多论文

敖德萨在2018年阿尔巴津演讲挑战

本文描述了敖德萨提交给2018年阿尔巴津演讲者Diarization挑战赛的作品。这一挑战解决了电视节目的数字化问题。这项工作探讨了三种不同的技术来表示语音片段，即二进制密钥，x向量和基于三重损失的嵌入。虽然无需训练的方法，如二进制密钥技术，可以很容易地应用于训练数据有限的场景，但鲁棒神经嵌入提取器的训练相当具有挑战性。然而，当训练数据丰富(开集条件)时，神经嵌入提供更鲁棒的分割，给出说话人表示，从而获得更好的分割性能。本文还报道了我们通过系统组合来提高扬声器偏振性能的努力。对于具有共同时间分辨率的系统，在聚类过程中在段级进行融合。当融合下的系统产生任意分辨率的分割时，它们在解级上进行组合。这两种融合方法都被证明可以提高双化性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IberSPEECH Conference

自引率

0.00%

发文量