IberSPEECH Conference最新文献

In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge RTVE 2018年挑战赛的域内自适应解决方案

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-45

I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO

{"title":"In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge","authors":"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-45","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-45","url":null,"abstract":"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121124916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

ODESSA at Albayzin Speaker Diarization Challenge 2018 敖德萨在2018年阿尔巴津演讲挑战

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-43

Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans

引用次数: 7

LSTM based voice conversion for laryngectomees 基于LSTM的喉切除术患者语音转换

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-26

Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez

引用次数: 9

On the use of Phone-based Embeddings for Language Recognition 基于电话的嵌入在语言识别中的应用

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-12

Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros

{"title":"On the use of Phone-based Embeddings for Language Recognition","authors":"Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros","doi":"10.21437/IBERSPEECH.2018-12","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-12","url":null,"abstract":"Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the so-called “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

EML Submission to Albayzin 2018 Speaker Diarization Challenge

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-44

O. Ghahabi, V. Fischer

引用次数: 4

Emotion Detection from Speech and Text 基于语音和文本的情感检测

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-15

Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres

引用次数: 11

RESTORE Project: REpair, STOrage and REhabilitation of speech 恢复项目:语言的修复、存储和恢复

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-34

I. Hernáez, E. Navas, J. Martín, J. Suárez

引用次数: 0

Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features 基于深度特征的ASV欺骗检测系统前后端技术性能评价

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-10

A. Alanís, A. Peinado, José Andrés González López, A. Gómez

{"title":"Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features","authors":"A. Alanís, A. Peinado, José Andrés González López, A. Gómez","doi":"10.21437/IBERSPEECH.2018-10","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-10","url":null,"abstract":"As Automatic Speaker Verification (ASV) becomes more popular, so do the ways impostors can use to gain illegal access to speech-based biometric systems. For instance, impostors can use Text-to-Speech (TTS) and Voice Conversion (VC) techniques to generate speech acoustics resembling the voice of a genuine user and, hence, gain fraudulent access to the system. To prevent this, a number of anti-spoofing countermeasures have been developed for detecting these high technology attacks. However, the detection of previously unforeseen spoofing attacks remains challenging. To address this issue, in this work we perform an extensive empirical investigation on the speech features and back-end classifiers providing the best overall performance for an antispoofing system based on a deep learning framework. In this architecture, a deep neural network is used to extract a single identity spoofing vector per utterance from the speech features. Then, the extracted vectors are passed to a classifier in order to make the final detection decision. Experimental evaluation is carried out on the standard ASVSpoof2015 data corpus. The results show that classical FBANK features and Linear Discriminant Analysis (LDA) obtain the best performance for the proposed system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128152698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Wide Residual Networks 1D for Automatic Text Punctuation 用于自动文本标点的宽残差网络 1D

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-62

Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO

{"title":"Wide Residual Networks 1D for Automatic Text Punctuation","authors":"Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-62","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-62","url":null,"abstract":"Documentation and analysis of multimedia resources usually requires a large pipeline with many stages. It is common to obtain texts without punctuation at some point, although later steps might need some accurate punctuation, like the ones related to natural language processing. This paper is focused on the task of recovering pause punctuation from a text without prosodic or acoustic information. We propose the use of Wide Residual Networks to predict which words should have a comma or stop from a text with removed punctuation. Wide Residual Networks are a well-known technique in image processing, but they are not commonly used in other areas as speech or natural language processing. We propose the use of Wide residual networks because they show great stability and the ability to work with long and short contextual dependencies in deep structures. Unlike for image processing, we will use 1-Dimensional convolutions because in text processing we only focus on the temporal dimension. Moreover, this architecture allows us to work with past and future context. This paper compares this architecture with Long-Short Term Memory cells which are used in this task and also combine the two architectures to get better results than each of them separately.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Restricted Boltzmann Machine Vectors for Speaker Clustering 说话人聚类的受限玻尔兹曼机向量

IberSPEECH Conference Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-3

Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando

引用次数: 5