I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
{"title":"In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge","authors":"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-45","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-45","url":null,"abstract":"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121124916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans
{"title":"ODESSA at Albayzin Speaker Diarization Challenge 2018","authors":"Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans","doi":"10.21437/IBERSPEECH.2018-43","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-43","url":null,"abstract":"This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126859299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez
{"title":"LSTM based voice conversion for laryngectomees","authors":"Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez","doi":"10.21437/IberSPEECH.2018-26","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-26","url":null,"abstract":"This work has been partially funded by the Spanish Ministryof Economy and Competitiveness with FEDER support (RE-STORE project, TEC2015-67163-C2-1-R), the Basque Govern-ment (BerbaOla project, KK-2018/00014) and from the Euro-pean Unions H2020 research and innovation programme un-der the Marie Curie European Training Network ENRICH(675324).","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122108410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros
{"title":"On the use of Phone-based Embeddings for Language Recognition","authors":"Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros","doi":"10.21437/IBERSPEECH.2018-12","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-12","url":null,"abstract":"Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the so-called “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EML Submission to Albayzin 2018 Speaker Diarization Challenge","authors":"O. Ghahabi, V. Fischer","doi":"10.21437/iberspeech.2018-44","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-44","url":null,"abstract":"Speaker diarization, who is speaking when, is one of the most challenging tasks in speaker recognition, as usually no prior information is available about the identity and the number of the speakers in an audio recording. The task will be more challenging when there is some noise or music on the background and the speakers are changed more frequently. This usually hap-pens in broadcast news conversations. In this paper, we use the EML speaker diarization system as a participation to the recent Albayzin Evaluation challenge. The EML system uses a real-time robust algorithm to make decision about the identity of the speakers approximately every 2 sec. The experimental results on about 16 hours of the developing data provided in the challenge show a reasonable accuracy of the system with a very low computational cost.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121587157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres
{"title":"Emotion Detection from Speech and Text","authors":"Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres","doi":"10.21437/IberSPEECH.2018-15","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-15","url":null,"abstract":"This work has been partially founded by bythe Spanish Government (TIN2014-54288-C4-4-R and TIN2017-85854-C4-3-R), and bythe European Commission H2020 SC1-PM15program under RIA 7 grant 69872.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RESTORE Project: REpair, STOrage and REhabilitation of speech","authors":"I. Hernáez, E. Navas, J. Martín, J. Suárez","doi":"10.21437/IBERSPEECH.2018-34","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-34","url":null,"abstract":"This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R)","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Alanís, A. Peinado, José Andrés González López, A. Gómez
{"title":"Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features","authors":"A. Alanís, A. Peinado, José Andrés González López, A. Gómez","doi":"10.21437/IBERSPEECH.2018-10","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-10","url":null,"abstract":"As Automatic Speaker Verification (ASV) becomes more popular, so do the ways impostors can use to gain illegal access to speech-based biometric systems. For instance, impostors can use Text-to-Speech (TTS) and Voice Conversion (VC) techniques to generate speech acoustics resembling the voice of a genuine user and, hence, gain fraudulent access to the system. To prevent this, a number of anti-spoofing countermeasures have been developed for detecting these high technology attacks. However, the detection of previously unforeseen spoofing attacks remains challenging. To address this issue, in this work we perform an extensive empirical investigation on the speech features and back-end classifiers providing the best overall performance for an antispoofing system based on a deep learning framework. In this architecture, a deep neural network is used to extract a single identity spoofing vector per utterance from the speech features. Then, the extracted vectors are passed to a classifier in order to make the final detection decision. Experimental evaluation is carried out on the standard ASVSpoof2015 data corpus. The results show that classical FBANK features and Linear Discriminant Analysis (LDA) obtain the best performance for the proposed system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128152698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO
{"title":"Wide Residual Networks 1D for Automatic Text Punctuation","authors":"Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-62","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-62","url":null,"abstract":"Documentation and analysis of multimedia resources usually requires a large pipeline with many stages. It is common to obtain texts without punctuation at some point, although later steps might need some accurate punctuation, like the ones related to natural language processing. This paper is focused on the task of recovering pause punctuation from a text without prosodic or acoustic information. We propose the use of Wide Residual Networks to predict which words should have a comma or stop from a text with removed punctuation. Wide Residual Networks are a well-known technique in image processing, but they are not commonly used in other areas as speech or natural language processing. We propose the use of Wide residual networks because they show great stability and the ability to work with long and short contextual dependencies in deep structures. Unlike for image processing, we will use 1-Dimensional convolutions because in text processing we only focus on the temporal dimension. Moreover, this architecture allows us to work with past and future context. This paper compares this architecture with Long-Short Term Memory cells which are used in this task and also combine the two architectures to get better results than each of them separately.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando
{"title":"Restricted Boltzmann Machine Vectors for Speaker Clustering","authors":"Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando","doi":"10.21437/IBERSPEECH.2018-3","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-3","url":null,"abstract":"Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131172353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}