Unsupervised k-means clustering based out-of-set candidate selection for robust open-set language recognition

2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI:10.1109/SLT.2016.7846284

Qian Zhang, J. Hansen

{"title":"Unsupervised k-means clustering based out-of-set candidate selection for robust open-set language recognition","authors":"Qian Zhang, J. Hansen","doi":"10.1109/SLT.2016.7846284","DOIUrl":null,"url":null,"abstract":"Research in open-set language identification (LID) generally focuses more on accurate in-set language modeling versus improved out-of-set (OOS) language rejection. The main reason for this is the increased cost/resources in collecting sufficient OOS data, versus the in-set languages of interest. Therefore, unknown or OOS language rejection is a challenge. To address this through efficient data collection, we propose a flexible OOS candidate selection method for universal OOS language coverage. Since state-of-the-art i-vector system followed by generative Gaussian back-end achieves effective performance for LID, the selected K candidates are expected to be general enough to represent the entire OOS language space. Therefore, an unsupervised k-means clustering approach is proposed for effective OOS candidate selection. This method is evaluated on a dataset derived from a large-scale corpus (LRE-09) which contains 40 languages. With the proposed selection method, the total OOS training diversity can be reduced by 89% and still achieve better performance on both OOS rejection and overall classification. The proposed method also shows clear benefits for greater data enhancement. Therefore, the proposed solution achieves sustained performance with the advantage of employing a minimum number of OOS language candidates efficiently.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Research in open-set language identification (LID) generally focuses more on accurate in-set language modeling versus improved out-of-set (OOS) language rejection. The main reason for this is the increased cost/resources in collecting sufficient OOS data, versus the in-set languages of interest. Therefore, unknown or OOS language rejection is a challenge. To address this through efficient data collection, we propose a flexible OOS candidate selection method for universal OOS language coverage. Since state-of-the-art i-vector system followed by generative Gaussian back-end achieves effective performance for LID, the selected K candidates are expected to be general enough to represent the entire OOS language space. Therefore, an unsupervised k-means clustering approach is proposed for effective OOS candidate selection. This method is evaluated on a dataset derived from a large-scale corpus (LRE-09) which contains 40 languages. With the proposed selection method, the total OOS training diversity can be reduced by 89% and still achieve better performance on both OOS rejection and overall classification. The proposed method also shows clear benefits for greater data enhancement. Therefore, the proposed solution achieves sustained performance with the advantage of employing a minimum number of OOS language candidates efficiently.

查看原文本刊更多论文

基于无监督k均值聚类的鲁棒开集语言识别的集外候选选择

开集语言识别(LID)的研究通常更侧重于准确的集内语言建模，而不是改进的集外语言拒绝。这样做的主要原因是收集足够的OOS数据的成本/资源增加，而不是使用感兴趣的内置语言。因此，未知或OOS语言拒绝是一个挑战。为了通过有效的数据收集来解决这个问题，我们提出了一种灵活的OOS候选选择方法，用于通用OOS语言覆盖。由于最先进的i-vector系统和生成高斯后端可以为LID实现有效的性能，因此所选择的K候选对象应该具有足够的通用性，以代表整个OOS语言空间。因此，提出了一种无监督k-means聚类方法来有效地选择OOS候选对象。该方法在包含40种语言的大型语料库(LRE-09)的数据集上进行了评估。采用本文提出的选择方法，总体OOS培训多样性可减少89%，但在OOS拒绝和总体分类方面仍能取得较好的表现。所提出的方法也显示出更大的数据增强的明显好处。因此，所提出的解决方案可以有效地使用最少数量的OOS候选语言，从而实现持续的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量