IberSPEECH Conference最新文献

筛选
英文 中文
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge RTVE 2018年挑战赛的域内自适应解决方案
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-45
I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
{"title":"In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge","authors":"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-45","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-45","url":null,"abstract":"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121124916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ODESSA at Albayzin Speaker Diarization Challenge 2018 敖德萨在2018年阿尔巴津演讲挑战
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-43
Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans
{"title":"ODESSA at Albayzin Speaker Diarization Challenge 2018","authors":"Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans","doi":"10.21437/IBERSPEECH.2018-43","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-43","url":null,"abstract":"This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126859299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LSTM based voice conversion for laryngectomees 基于LSTM的喉切除术患者语音转换
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-26
Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez
{"title":"LSTM based voice conversion for laryngectomees","authors":"Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez","doi":"10.21437/IberSPEECH.2018-26","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-26","url":null,"abstract":"This work has been partially funded by the Spanish Ministryof Economy and Competitiveness with FEDER support (RE-STORE project, TEC2015-67163-C2-1-R), the Basque Govern-ment (BerbaOla project, KK-2018/00014) and from the Euro-pean Unions H2020 research and innovation programme un-der the Marie Curie European Training Network ENRICH(675324).","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122108410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
On the use of Phone-based Embeddings for Language Recognition 基于电话的嵌入在语言识别中的应用
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-12
Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros
{"title":"On the use of Phone-based Embeddings for Language Recognition","authors":"Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros","doi":"10.21437/IBERSPEECH.2018-12","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-12","url":null,"abstract":"Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the so-called “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
EML Submission to Albayzin 2018 Speaker Diarization Challenge
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-44
O. Ghahabi, V. Fischer
{"title":"EML Submission to Albayzin 2018 Speaker Diarization Challenge","authors":"O. Ghahabi, V. Fischer","doi":"10.21437/iberspeech.2018-44","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-44","url":null,"abstract":"Speaker diarization, who is speaking when, is one of the most challenging tasks in speaker recognition, as usually no prior information is available about the identity and the number of the speakers in an audio recording. The task will be more challenging when there is some noise or music on the background and the speakers are changed more frequently. This usually hap-pens in broadcast news conversations. In this paper, we use the EML speaker diarization system as a participation to the recent Albayzin Evaluation challenge. The EML system uses a real-time robust algorithm to make decision about the identity of the speakers approximately every 2 sec. The experimental results on about 16 hours of the developing data provided in the challenge show a reasonable accuracy of the system with a very low computational cost.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121587157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Emotion Detection from Speech and Text 基于语音和文本的情感检测
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-15
Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres
{"title":"Emotion Detection from Speech and Text","authors":"Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres","doi":"10.21437/IberSPEECH.2018-15","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-15","url":null,"abstract":"This work has been partially founded by bythe Spanish Government (TIN2014-54288-C4-4-R and TIN2017-85854-C4-3-R), and bythe European Commission H2020 SC1-PM15program under RIA 7 grant 69872.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
RESTORE Project: REpair, STOrage and REhabilitation of speech 恢复项目:语言的修复、存储和恢复
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-34
I. Hernáez, E. Navas, J. Martín, J. Suárez
{"title":"RESTORE Project: REpair, STOrage and REhabilitation of speech","authors":"I. Hernáez, E. Navas, J. Martín, J. Suárez","doi":"10.21437/IBERSPEECH.2018-34","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-34","url":null,"abstract":"This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R)","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features 基于深度特征的ASV欺骗检测系统前后端技术性能评价
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-10
A. Alanís, A. Peinado, José Andrés González López, A. Gómez
{"title":"Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features","authors":"A. Alanís, A. Peinado, José Andrés González López, A. Gómez","doi":"10.21437/IBERSPEECH.2018-10","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-10","url":null,"abstract":"As Automatic Speaker Verification (ASV) becomes more popular, so do the ways impostors can use to gain illegal access to speech-based biometric systems. For instance, impostors can use Text-to-Speech (TTS) and Voice Conversion (VC) techniques to generate speech acoustics resembling the voice of a genuine user and, hence, gain fraudulent access to the system. To prevent this, a number of anti-spoofing countermeasures have been developed for detecting these high technology attacks. However, the detection of previously unforeseen spoofing attacks remains challenging. To address this issue, in this work we perform an extensive empirical investigation on the speech features and back-end classifiers providing the best overall performance for an antispoofing system based on a deep learning framework. In this architecture, a deep neural network is used to extract a single identity spoofing vector per utterance from the speech features. Then, the extracted vectors are passed to a classifier in order to make the final detection decision. Experimental evaluation is carried out on the standard ASVSpoof2015 data corpus. The results show that classical FBANK features and Linear Discriminant Analysis (LDA) obtain the best performance for the proposed system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128152698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Wide Residual Networks 1D for Automatic Text Punctuation 用于自动文本标点的宽残差网络 1D
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-62
Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO
{"title":"Wide Residual Networks 1D for Automatic Text Punctuation","authors":"Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-62","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-62","url":null,"abstract":"Documentation and analysis of multimedia resources usually requires a large pipeline with many stages. It is common to obtain texts without punctuation at some point, although later steps might need some accurate punctuation, like the ones related to natural language processing. This paper is focused on the task of recovering pause punctuation from a text without prosodic or acoustic information. We propose the use of Wide Residual Networks to predict which words should have a comma or stop from a text with removed punctuation. Wide Residual Networks are a well-known technique in image processing, but they are not commonly used in other areas as speech or natural language processing. We propose the use of Wide residual networks because they show great stability and the ability to work with long and short contextual dependencies in deep structures. Unlike for image processing, we will use 1-Dimensional convolutions because in text processing we only focus on the temporal dimension. Moreover, this architecture allows us to work with past and future context. This paper compares this architecture with Long-Short Term Memory cells which are used in this task and also combine the two architectures to get better results than each of them separately.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Restricted Boltzmann Machine Vectors for Speaker Clustering 说话人聚类的受限玻尔兹曼机向量
IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-3
Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando
{"title":"Restricted Boltzmann Machine Vectors for Speaker Clustering","authors":"Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando","doi":"10.21437/IBERSPEECH.2018-3","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-3","url":null,"abstract":"Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131172353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信