{"title":"Unsupervised k-means clustering based out-of-set candidate selection for robust open-set language recognition","authors":"Qian Zhang, J. Hansen","doi":"10.1109/SLT.2016.7846284","DOIUrl":null,"url":null,"abstract":"Research in open-set language identification (LID) generally focuses more on accurate in-set language modeling versus improved out-of-set (OOS) language rejection. The main reason for this is the increased cost/resources in collecting sufficient OOS data, versus the in-set languages of interest. Therefore, unknown or OOS language rejection is a challenge. To address this through efficient data collection, we propose a flexible OOS candidate selection method for universal OOS language coverage. Since state-of-the-art i-vector system followed by generative Gaussian back-end achieves effective performance for LID, the selected K candidates are expected to be general enough to represent the entire OOS language space. Therefore, an unsupervised k-means clustering approach is proposed for effective OOS candidate selection. This method is evaluated on a dataset derived from a large-scale corpus (LRE-09) which contains 40 languages. With the proposed selection method, the total OOS training diversity can be reduced by 89% and still achieve better performance on both OOS rejection and overall classification. The proposed method also shows clear benefits for greater data enhancement. Therefore, the proposed solution achieves sustained performance with the advantage of employing a minimum number of OOS language candidates efficiently.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Research in open-set language identification (LID) generally focuses more on accurate in-set language modeling versus improved out-of-set (OOS) language rejection. The main reason for this is the increased cost/resources in collecting sufficient OOS data, versus the in-set languages of interest. Therefore, unknown or OOS language rejection is a challenge. To address this through efficient data collection, we propose a flexible OOS candidate selection method for universal OOS language coverage. Since state-of-the-art i-vector system followed by generative Gaussian back-end achieves effective performance for LID, the selected K candidates are expected to be general enough to represent the entire OOS language space. Therefore, an unsupervised k-means clustering approach is proposed for effective OOS candidate selection. This method is evaluated on a dataset derived from a large-scale corpus (LRE-09) which contains 40 languages. With the proposed selection method, the total OOS training diversity can be reduced by 89% and still achieve better performance on both OOS rejection and overall classification. The proposed method also shows clear benefits for greater data enhancement. Therefore, the proposed solution achieves sustained performance with the advantage of employing a minimum number of OOS language candidates efficiently.