{"title":"Multimodal speech recognition using mouth images from depth camera","authors":"Y. Yasui, Nakamasa Inoue, K. Iwano, K. Shinoda","doi":"10.1109/APSIPA.2017.8282227","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282227","url":null,"abstract":"Deep learning has been proved to be effective in multimodal speech recognition using facial frontal images. In this paper, we propose a new deep learning method, a trimodal deep autoencoder, which uses not only audio signals and face images, but also depth images of faces, as the inputs. We collected continuous speech data from 20 speakers with Kinect 2.0 and used them for our evaluation. The experimental results with 10dB SNR showed that our method reduced errors by 30%, from 34.6% to 24.2% from audio-only speech recognition when SNR was 10dB. In particular, it is effective for recognizing some consonants including /k/, /t/.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117067811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning-based speaking rate-dependent hierarchical prosodie model for Mandarin TTS","authors":"Yen-Ting Lin, Chen-Yu Chiang","doi":"10.1109/APSIPA.2017.8282228","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282228","url":null,"abstract":"Speaking Rate-dependent Hierarchical Prosodie Model (SR-HPM) is a syllable-based statistical prosodie model and has been successfully served as a prosody generation model in a speaking rate-controlled text-to-speech system for Mandarin, and two Chinese dialects: Taiwan Min and Si-Xian Hakka. Excited by the success of utilizing deep learning (DL) techniques in parametric speech synthesis based on the HMM-based speech synthesis system, this study aims to improve the performance of the SR-HPM in prosody generation by replacing the conventional cascaded statistical sub-models with DL-based models, i.e. the DL-based SR-HPM. Each of the sub-model is first independently realized by a specially designed DL-based model based on its input-output characteristics. Then, all sub-models are cascaded and unified as one deep neural structure with their parameters being obtained by an end-to-end (linguistic feature-to-prosodic acoustic feature) optimization manner. The subjective and objective tests show that the DL-based SR-HPM performs better than the conventional statistical SR-HPM in prosody generation.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117350345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic identification of pathological voice quality based on the GRBAS categorization","authors":"A. Sasou","doi":"10.1109/APSIPA.2017.8282229","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282229","url":null,"abstract":"Acoustic analysis-based automatic detection of voice pathologies enables non-invasive, low-cost and objective assessments of the presence of disorders, which might assist in accelerating and improving the diagnosis and clinical treatment given to patients. In this paper, we focus on the automatic assessment of pathological voice quality by identifying the four attributes of Roughness, Breathiness, Asthenia, and Strain based on the GRBAS categorization. The proposed method adopts higher-order local auto-correlation (HLAC) features, which are calculated from the excitation source signal obtained by an automatic topology-generated AR-HMM analysis, and identifies the four attributes using a feed-forward neural network (FFNN)-based classifier. In the experiments, an average F-measure of 87.25% was obtained for a speaker- based identification task, which confirms the feasibility of the proposed method.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"320 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120838367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dictionary design and disparity interpolation on distributed compressed sensing for light field image","authors":"Yusaku Akiyoshi, T. Sumi, Y. Kuroki","doi":"10.1109/APSIPA.2017.8282035","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282035","url":null,"abstract":"This paper discusses two terms of distributed compressed sensing for multi-view point images generated with a light field camera. Our first discussion is on the performance of dictionary designed with ADMM (Alternating Direction Method of Multipliers) and that of K-SVD since reconstructed image quality depends on the dictionary. The second discussion is disparity interpolation of non-key frames. The interpolation accuracy contributes to reconstructed image quality and effective dictionary design. Then, this paper compares three disparity interpolation methods: the overlapped block matching, the conventional optical flow, and the TVL1-optical-flow. Experimental results show that the dictionary design with ADMM is faster than that with K-SVD whereas PSNR values using ADMM are slightly lower than those of K-SVD, and the TVL1-optical flow is the fastest holding the highest PSNR values.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127330333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hoang-Quoc Nguyen-Son, Ngoc-Dung T. Tieu, H. Nguyen, Junichi Yamagishi, Isao Echizen
{"title":"Identifying computer-generated text using statistical analysis","authors":"Hoang-Quoc Nguyen-Son, Ngoc-Dung T. Tieu, H. Nguyen, Junichi Yamagishi, Isao Echizen","doi":"10.1109/APSIPA.2017.8282270","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282270","url":null,"abstract":"Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feedforward sequential memory networks based encoder-decoder model for machine translation","authors":"Junfeng Hou, Shiliang Zhang, Lirong Dai, Hui Jiang","doi":"10.1109/APSIPA.2017.8282100","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282100","url":null,"abstract":"Recently recurrent neural networks based encoder-decoder model is a popular approach to sequence to sequence mapping problems, such as machine translation. However, it is time-consuming to train the model since symbols in a sequence can not be processed parallelly by recurrent neural networks because of the temporal dependency restriction. In this paper we present a sequence to sequence model by replacing the recurrent neural networks with feedforward sequential memory networks in both encoder and decoder, which enables the new architecture to encode the entire source sentence simultaneously. We also modify the attention module to make the decoder generate outputs simultaneously during training. We achieve comparable results in WMT'14 English-to-French translation task with 1.4 to 2 times faster during training because of temporal independency in feedforward sequential memory networks based encoder and decoder.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123713131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A segmental DNN/i-vector approach for digit-prompted speaker verification","authors":"Jie Yan, Lei Xie, Guangsen Wang, Zhonghua Fu","doi":"10.1109/APSIPA.2017.8281992","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8281992","url":null,"abstract":"DNN/i-vectors have achieved state-of-the-art performance in text-independent speaker verification systems. For such systems, the UBM posteriors are replaced with the DNN posteriors when training the i-vector extractor to better model the phonetic space. However, the DNN/i-vector systems have limited success on text-dependent speaker verification systems as the lexical variabilities, which are important for such applications, are suppressed in the utterance-level i-vectors. In this paper, we propose a segmental DNN/i-vector approach for the digit-prompted speaker verification task. Specifically, we segment the utterance into digits and model each digit using an individual DNN/i-vector system. By modeling the variability for each digit independently, we can focus more on the speaker characteristics for each digit. To take into consideration the uncertainties in the DNN posteriors, we propose a confidence measure based weighting method. On the RSR2015 dataset, the proposed approach yields an equal error rate of 3.44%, compared to 5.76% of the baseline utterance-level DNN/i-vector system and 4.54% of the joint factor analysis (JFA) system.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114920957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-layer background sprite model for 2D-to-3D video conversion","authors":"W. Lie, Chih-Hao Hu, Yi-Kai Chen, J. Chiang","doi":"10.1109/APSIPA.2017.8282033","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282033","url":null,"abstract":"This paper presents a technique for semi-automatic 2D-to-3D stereo video conversion, which was known to provide user intervention in segmenting foregrounds and assigning corresponding depth information for key frames and then get the depth maps for other non-key frames via automatic depth propagation. Our algorithm escapes from traditional depth propagation paradigm based on motion estimation and compensation. For foregrounds in non- key frames, object kernels standing for the most confident parts are identified first and then used as the seeds for graph-cut segmentation. Since the graph-cut segmentation for foregrounds is performed independently for each non-key frame, the results will be free of the limitation by objects' motion activity. For backgrounds, all video frames, after foregrounds being removed, are integrated into a common multi-layer background sprite model (ML-BSM) based on image registration algorithm. Users can then draw background depth profiles for the ML-BSM in a video-based manner (not frame-based), thus reducing the human efforts required significantly. Our ML-BSM algorithm is an extension of our prior work, BSM [8], aiming to solve the cases when the foreground and the background have a large depth variation or the camera has a substantial panning/rotating motion. Experiments show that the adoption of multi-layers BSM architecture and iterative foreground refinement based on BSM validation can improve the depth image quality significantly.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117075972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison analysis of ICA versus MCA-KSVD blind source separation on task-related fMRI data","authors":"Nam H. Le, Khang N. Nguyen, Hien M. Nguyen","doi":"10.1109/APSIPA.2017.8282196","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282196","url":null,"abstract":"Decomposition of working brain into meaningful clusters of regions is an important step to understand brain functionality. Blind source separation algorithms can achieve this, which results in diverse outcomes dependent upon underlying assumptions of the decomposition algorithm in use. The conventional data-driven method to detect brain functional networks is the Independent Component Analysis (ICA). The ICA method assumes decomposed components are statistically independent of each other. However, such a mathematical assumption is physiologically uncertain in regard to its applications to functional MRI (fMRI) studies. A recently proposed MCA-KSVD method, which stands for Morphological Component Analysis implemented using a K-SVD algorithm, relaxes the independence assumption imposed by the ICA method. In this study, a comprehensive comparison between the conventional ICA and MCA-KSVD methods was conducted in the presence of various simulated noise conditions. Experimental results showed that in a task-related fMRI experiment, the MCA-KSVD method successfully identified same networks as those detected by the ICA method and had advantages of better signal localization and spatial resolution. However, improper choices of the sparsity parameter and the number of trained atoms introduced phenomena, namely signal leakage, signal splitting and signal ambiguity. The MCA-KSVD method could be used as an alternative or in parallel with the ICA method, but with careful consideration of model parameter selection.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129574524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial multi-channel linear prediction for dereverberation of ad-hoc microphones","authors":"Shahab Pasha, C. Ritz, Y. Zou","doi":"10.1109/APSIPA.2017.8282306","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282306","url":null,"abstract":"A spatially modified multi-channel linear prediction analysis is proposed and tested for the dereverberation of ad-hoc microphone arrays. The proposed spatial multi-channel linear prediction takes into account the estimated spatial distances between each microphone and the source and is applied for the short-term dereverberation (pre-whitening). The delayed linear prediction is then applied for the suppression of the late reverberation. Results suggest that the proposed method outperforms the standard linear prediction based methods when applied to the ad-hoc microphones. It is also concluded that the kurtosis of the linear prediction residual signal is a reliable distance feature when the microphone gains are inconsistent and the sources energy levels vary.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128743303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}