{"title":"Audio Caption in a Car Setting with a Sentence-Level Loss","authors":"Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu","doi":"10.1109/ISCSLP49672.2021.9362117","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362117","url":null,"abstract":"Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134090603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Realizing Sign Language to Emotional Speech Conversion by Deep Learning","authors":"Nan Song, Hongwu Yang, Pengpeng Zhi","doi":"10.1109/ISCSLP49672.2021.9362060","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362060","url":null,"abstract":"This paper proposes a framework of sign language to emotional speech conversion based on deep learning to solve communication disorders between people with language barriers and healthy people. We firstly trained a gesture recognition model and a facial expression recognition model by a deep convolutional generative adversarial network (DCGAN). Then we trained an emotional speech acoustic model with a hybrid long short-term memory (LSTM). We select the initials and the finals of Mandarin as the emotional speech synthesis units to train a speaker-independent average voice model (AVM). The speaker adaptation is applied to train a speaker-dependent hybrid LST-M model with one target speaker emotional corpus from AVM. Finally, we combine the gesture recognition model and facial expression recognition model with the emotional speech synthesis model to realize the sign language to emotional speech conversion. The experiments show that the recognition rate of gesture recognition is 93.96%, and the recognition rate of facial expression recognition in the CK+ database is 96.01%. The converted emotional speech not only has high quality but also can accurately express the facial expression.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126064079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}