{"title":"Getting Your Conversation on Track: Estimation of Residual Life for Conversations","authors":"Zexin Lu, Jing Li, Yingyi Zhang, Haisong Zhang","doi":"10.1109/SLT48900.2021.9383544","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383544","url":null,"abstract":"This paper presents a predictive study on the progress of conversations. Specifically, we estimate the residual life for conversations, which is defined as the count of new turns to occur in a conversation thread. While most previous work focus on coarse-grained estimation that classifies the number of coming turns into two categories, we study fine-grained categorization for varying lengths of residual life. To this end, we propose a hierarchical neural model that jointly explores indicative representations from the content in turns and the structure of conversations in an end-to-end manner. Extensive experiments on both human-human and human-machine conversations demonstrate the superiority of our proposed model and its potential helpfulness in chatbot response selection.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129414269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-Context Conversational Representation Learning: Self-Supervised Learning For Conversational Documents","authors":"Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi","doi":"10.1109/SLT48900.2021.9383584","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383584","url":null,"abstract":"This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequential labeling is the difficulty of collecting labeled conversational documents, as manual annotations are very costly. To deal with this issue, we propose large-context conversational representation learning (LC-CRL), a self-supervised learning method specialized for conversational documents. A self-supervised learning task in LC-CRL involves the estimation of an utterance using all the surrounding utterances based on large-context language modeling. In this way, LC-CRL enables us to effectively utilize unlabeled conversational documents and thereby enhances the utterance-level sequential labeling. The results of experiments on scene segmentation tasks using contact center conversational datasets demonstrate the effectiveness of the proposed method.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133875541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim
{"title":"Improved Parallel Wavegan Vocoder with Perceptually Weighted Spectrogram Loss","authors":"Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim","doi":"10.1109/SLT48900.2021.9383549","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383549","url":null,"abstract":"This paper proposes a spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN vocoder successfully generates waveform sequences using a fast non-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteria with a generative adversarial network, the light-weight convolutional networks can be effectively trained without any distillation process. To further improve the vocoding performance, we propose the application of frequency-dependent weighting to the MR-STFT loss function. The proposed method penalizes perceptually-sensitive errors in the frequency domain; thus, the model is optimized toward reducing auditory noise in the synthesized speech. Subjective listening test results demonstrate that our proposed method achieves 4.21 and 4.26 TTS mean opinion scores for female and male Korean speakers, respectively.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132870915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Articulatory Comparison of L1 and L2 Speech for Mispronunciation Diagnosis","authors":"Subash Khanal, Michael T. Johnson, Narjes Bozorg","doi":"10.1109/SLT48900.2021.9383574","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383574","url":null,"abstract":"This paper compares the difference in articulation patterns between native (L1) and non-native (L2) Mandarin speakers of English, for the purpose of providing an understanding of mispronunciation behaviors of L2 learners. Consensus transcriptions from the Electromagnetic Articulography Mandarin Accented English (EMA-MAE) corpus are used to identify commonly occurring substitution errors for consonants and vowels. Phoneme level alignments of the utterances produced by speech recognition models are used to extract articulatory feature vectors representing correct and substituted sounds from L1 and L2 speaker groups respectively. The articulatory features that are significantly different between the two groups are identified along with the direction of error for the L2 speaker group. Experimental results provide information about which types of substitutions are most common and which specific articulators are the most significant contributors to those errors.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prakhar Swarup, D. Chakrabarty, A. Sapru, Hitesh Tulsiani, Harish Arsikere, S. Garimella
{"title":"Efficient Large Scale Semi-Supervised Learning for CTC Based Acoustic Models","authors":"Prakhar Swarup, D. Chakrabarty, A. Sapru, Hitesh Tulsiani, Harish Arsikere, S. Garimella","doi":"10.1109/SLT48900.2021.9383536","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383536","url":null,"abstract":"Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabeled data to improve the accuracy of speech recognition systems. While the previous studies have established the efficacy of various SSL methods on varying amounts of data, this paper presents largest ASR SSL experiment ever conducted till date where 75K hours of labeled and 1.2 million hours of unlabeled data is used for model training. In addition, the paper introduces couple of novel techniques to facilitate such a large scale experiment: 1) a simple scalable Teacher-Student based SSL method for connectionist temporal classification (CTC) objective and 2) effective data selection mechanisms for leveraging massive amounts of unlabeled data to boost the performance of student models. Further, we apply SSL in all stages of the acoustic model training, including final stage sequence discriminative training. Our experiments indicate encouraging word error rate (WER) gains up to 14% in such a large transcribed data regime due to the SSL training.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127445329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Exploration of Log-Mel Spectrogram and MFCC Features for Alzheimer’s Dementia Recognition from Spontaneous Speech","authors":"Amit Meghanani, S. AnoopC., A. Ramakrishnan","doi":"10.1109/SLT48900.2021.9383491","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383491","url":null,"abstract":"In this work, we explore the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer’s dementia (AD) recognition on ADReSS challenge dataset. We use three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction: (i) convolutional neural network followed by a long-short term memory network (CNN-LSTM), (ii) pre-trained ResNet18 network followed by LSTM (ResNet-LSTM), and (iii) pyramidal bidirectional LSTM followed by a CNN (pBLSTM-CNN). CNN-LSTM achieves an accuracy of 64.58% with MFCC features and ResNet-LSTM achieves an accuracy of 62.5% using log-Mel spectrograms. pBLSTM-CNN and ResNet-LSTM models achieve root mean square errors (RMSE) of 5.9 and 5.98 in the MMSE score prediction, using the log-Mel spectrograms. Our results beat the baseline accuracy (62.5%) and RMSE (6.14) reported for acoustic features on ADReSS challenge dataset. The results suggest that log-Mel spectrograms and MFCCs are effective features for AD recognition problem when used with DNN models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129168583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroaki Takatsu, Mayu Okuda, Yoichi Matsuyama, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi
{"title":"Personalized Extractive Summarization for a News Dialogue System","authors":"Hiroaki Takatsu, Mayu Okuda, Yoichi Matsuyama, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT48900.2021.9383568","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383568","url":null,"abstract":"In modern society, people’s interests and preferences are diversifying. Along with this, the demand for personalized summarization technology is increasing. In this study, we propose a method for generating summaries tailored to each user’s interests using profile features obtained from questionnaires administered to users of our spoken-dialogue news delivery system. We propose a method that collects and uses the obtained user profile features to generate a summary tailored to each user’s interests, specifically, the sentence features obtained by BERT and user profile features obtained from the questionnaire result. In addition, we propose a method for extracting sentences by solving an integer linear programming problem that considers redundancy and context coherence, using the degree of interest in sentences estimated by the model. The results of our experiments confirmed that summaries generated based on the degree of interest in sentences estimated using user profile information can transmit information more efficiently than summaries based solely on the importance of sentences.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115879224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer-Based Direct Speech-To-Speech Translation with Transcoder","authors":"Takatomo Kano, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT48900.2021.9383496","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383496","url":null,"abstract":"Traditional speech translation systems use a cascade manner that concatenates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis to translate speech from one language to another language in a step-by-step manner. Unfortunately, since those components are trained separately, MT often struggles to handle ASR errors, resulting in unnatural translation results. Recently, one work attempted to construct direct speech translation in a single model. The model used a multi-task scheme that learns to predict not only the target speech spectrograms directly but also the source and target phoneme transcription as auxiliary tasks. However, that work was only evaluated Spanish-English language pairs with similar syntax and word order. With syntactically distant language pairs, speech translation requires distant word order, and thus direct speech frame-to-frame alignments become difficult. Another direction was to construct a single deep-learning framework while keeping the step-by-step translation process. However, such studies focused only on speech-to-text translation. Furthermore, all of these works were based on a recurrent neural net-work (RNN) model. In this work, we propose a step-by-step scheme to a complete end-to-end speech-to-speech translation and propose a Transformer-based speech translation using Transcoder. We compare our proposed and multi-task model using syntactically similar and distant language pairs.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Through the Words of Viewers: Using Comment-Content Entangled Network for Humor Impression Recognition","authors":"Huan-Yu Chen, Yun-Shao Lin, Chi-Chun Lee","doi":"10.1109/SLT48900.2021.9383564","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383564","url":null,"abstract":"Research into understanding humor has been investigated over centuries. It has recently attracted various technical effort in computing humor automatically from data, especially for humor in speech. Comprehension on the same speech and the ability to realize a humor event vary depending on each individual audience’s background and experience. Most previous works on automatic humor detection or impression recognition mainly model the produced textual content only without considering audience responses. We collect a corpus of TED Talks including audience comments for each of the presented TED speech. We propose a novel network architecture that considers the natural entanglement between speech transcripts and user’s online feedbacks as an integrative graph structure, where the content speech and online feedbacks are nodes where the edges are connected though their common words. Our model achieves 61.2% of accuracy in a three-class classification on humor impression recognition on TED talks; our experiments further demonstrate viewers comments are essential in improving the recognition tasks, and a joint content-comment modeling achieves the best recognition.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 5part1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113963900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mao Saeki, Yoichi Matsuyama, Satoshi Kobashikawa, Tetsuji Ogawa, Tetsunori Kobayashi
{"title":"Analysis of Multimodal Features for Speaking Proficiency Scoring in an Interview Dialogue","authors":"Mao Saeki, Yoichi Matsuyama, Satoshi Kobashikawa, Tetsuji Ogawa, Tetsunori Kobayashi","doi":"10.1109/SLT48900.2021.9383590","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383590","url":null,"abstract":"This paper analyzes the effectiveness of different modalities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational competence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi-modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effectiveness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modalities are effective in the order of lexical, acoustic, and visual features.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114554111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}