{"title":"Voichap: A standalone real-time voice change application on iOS platform","authors":"Xiaoling Wu, Shuhua Gao, Dong Huang, Cheng Xiang","doi":"10.1109/APSIPA.2017.8282129","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282129","url":null,"abstract":"High-quality voice mimicry is appealing to everyone. However, only few vocal geniuses are endowed with the talent for vivid mimicry. Professional mimics have to be trained and practice over many years for various vocal skills, such as vocal control, precision in pitch, sense of rhythm and personal style, etc. To help achieve our dream for fascinating voice mimicry, such as speaking in a celebrity's voice, we have developed a real-time voice conversion technology for the general users. You can specify any target (like your friend or a celebrity) for your voice conversion as long as the target's training utterances are available. To facilitate easy use, we have implemented it efficiently as a mobile application on the iOS platform, called Voichap, which can generate a desired natural target voice. Notably, the complete training and conversion process is performed locally in a reasonable time, with no need for on-line server service, to improve the user experience. Just three steps are enough to use this application: choose a target, record your voice and then have fun listening to your converted voice.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132318018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data augmentation and feature extraction using variational autoencoder for acoustic modeling","authors":"H. Nishizaki","doi":"10.1109/APSIPA.2017.8282225","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282225","url":null,"abstract":"A data augmentation and feature extraction method using a variational autoencoder (VAE) for acoustic modeling is described. A VAE is a generative model based on variational Bayesian learning using a deep learning framework. A VAE can extract latent values its input variables to generate new information. VAEs are widely used to generate pictures and sentences. In this paper, a VAE is applied to speech corpus data augmentation and feature vector extraction from speech for acoustic modeling. First, the size of a speech corpus is doubled by encoding latent variables extracted from original utterances using a VAE framework. The latent variables extracted from speech waveforms have latent \"meanings\" of the waveforms. Therefore, latent variables can be used as acoustic features for automatic speech recognition (ASR). This paper experimentally shows the effectiveness of data augmentation using a VAE framework and that latent variable-based features can be utilized in ASR.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132382884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanchao Li, C. Ishi, Nigel G. Ward, K. Inoue, Shizuka Nakamura, K. Takanashi, Tatsuya Kawahara
{"title":"Emotion recognition by combining prosody and sentiment analysis for expressing reactive emotion by humanoid robot","authors":"Yuanchao Li, C. Ishi, Nigel G. Ward, K. Inoue, Shizuka Nakamura, K. Takanashi, Tatsuya Kawahara","doi":"10.1109/APSIPA.2017.8282243","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282243","url":null,"abstract":"In order to achieve rapport in human-robot interaction, it is important to express a reactive emotion that matches with the user's mental state. This paper addresses an emotion recognition method which combines prosody and sentiment analysis for the system to properly express reactive emotion. In the user emotion recognition module, valence estimation from prosodic features is combined with sentiment analysis of text information. Combining the two information sources significantly improved the valence estimation accuracy. In the reactive emotion expression module, the system's emotion category and level are predicted using the parameters estimated in the recognition module, based on distributions inferred from human-human dialog data. Subjective evaluation results show that the proposed method is effective for expressing human-like reactive emotion.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131062947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic vehicle classification using center strengthened convolutional neural network","authors":"Kuan-Chung Wang, Yoga Dwi Pranata, Jia-Ching Wang","doi":"10.1109/APSIPA.2017.8282187","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282187","url":null,"abstract":"Vehicle classification is one of the major part for the smart road management system and traffic management system. The use of appropriate algorithms has a significant impact in the process of classification. In this paper, we propose a deep neural network, named center strengthened convolutional neural network (CS- CNN), for handling central part image feature enhancement with non-fixed size input. The main hallmark of this proposed architecture is center enhancement that extract additional feature from central of image by ROI pooling. Another, our CS-CNN, based on VGG network architecture, joint with ROI pooling layer to get elaborate feature maps. Our proposed method will be compared with other typical deep learning architecture like VGG-s and VGG-Verydeep-16. In the experiments, we show the outstanding performance which getting more than 97% accuracy on vehicle classification with only few training data from Caltech256 datasets.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132911473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Chen, Jie Hou, Yutong Xing, Yanting Chen, Hua Lin, J. Dang
{"title":"The acoustic characteristics of tone 3 in standard chinese produced by prelingually deaf adults","authors":"Yu Chen, Jie Hou, Yutong Xing, Yanting Chen, Hua Lin, J. Dang","doi":"10.1109/APSIPA.2017.8282105","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282105","url":null,"abstract":"This paper studies the acoustic characteristics of Tone 3 produced by prelingually deaf adults and finds that the deaf females and males apply different strategies to realize this dipping-rising tone: for deaf females, they tend to use creaky voice in producing this tone; for deaf males, they adopt a longer duration and a slower turning to distinguish T3 from other tones in Standard Chinese. Moreover, results of this study support the viewpoint that the prelingually deaf adults could benefit from their longer experience of cochlear implant to improve their capability of tone's production.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133054959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressed high dimensional features for speaker spoofing detection","authors":"Yuanjun Zhao, R. Togneri, V. Sreeram","doi":"10.1109/APSIPA.2017.8282108","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282108","url":null,"abstract":"The vulnerability in Automatic Speaker Verification (ASV) systems to spoofing attacks such as speech synthesis (SS) and voice conversion (VC) has been recently proved. High- dimensional magnitude and phase based features possess outstanding spoofing detection performance but are not compatible with the Gaussian Mixture Model (GMM) classifiers which are commonly deployed in speaker recognition systems. In this paper, a Compressed Sensing (CS) framework is initially combined with high-dimensional (HD) features and a derived CS-HD based feature is proposed. A standalone spoofing detector assembled with the GMM classifier is evaluated on the ASVspoof 2015 database. Two ASV systems integrated with the spoofing detector are also tested. For the separate detector, an equal error rate (EER) of 0.01% and 5.35% are reached on the evaluation set for known attack and unknown attack, respectively. While for the ASV systems, the best EERs of 0.02% and 5.26% are achieved. The proposed CS-HD feature can obtain similar results with lower dimension than other systems. This suggests that the verification system can be made more computationally efficient.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lung sound classification based on Hilbert-Huang transform features and multilayer perceptron network","authors":"Yunxia Liu, Yang Yang, Yuehui Chen","doi":"10.1109/APSIPA.2017.8282137","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282137","url":null,"abstract":"Accurate classification of lung sounds plays an important role in noninvasive diagnosis of pulmonary diseases. A novel lung sound classification algorithm based on Hilbert-Huang transform (HHT) features and multilayer perceptron network is proposed in this paper. Three types of HHT domain features, namely the instantaneous envelope amplitude of intrinsic mode functions (IMF), envelop of instantaneous amplitude of the first four layers IMFs, and max value of the marginal spectrum are proposed for jointly characterization of the time-frequency properties of lung sounds. These proposed features are feed into a multi-layer perceptron neural network for training and testing of lung sound signal classification. Abundant experimental work is carried out to verify the effectiveness of the proposed algorithm.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132728083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fuzzy qualitative approach for micro-expression recognition","authors":"C. H. Lim, Kam Meng Goh","doi":"10.1109/APSIPA.2017.8282300","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282300","url":null,"abstract":"Micro-expression recognition has received increasing attention in the field of computer vision nowadays. Many state-of-the-art approaches have been reported but it can be seen that most of the results are capped at a certain level of accuracy. This is due to the ambiguity that abounded during the extraction of extremely short period of facial movements. These ambiguities deteriorate the performance of the overall recognition rate if using crisp classifier. This paper proposed to study the micro-expression as a non-mutual exclusive classification problem and examine the effectiveness of multi-label classification in micro-expression recognition by using the Fuzzy Qualitative Rank Classifier (FQRC). In addition, the extension of FQRC with feature selection and part-based model is proposed which shows promising results after tested on CASME II dataset.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131321012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast and energy efficient FPGA-based system for real-time object tracking","authors":"Xiaobai Chen, Jinlong Xu, Zhiyi Yu","doi":"10.1109/APSIPA.2017.8282162","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282162","url":null,"abstract":"Visual object tracking has achieved great advances in the past decades and has been widely applied in vision-based applications. Due to the popularization of the power-sensitive mobile platform, robust and low power real-time tracking solution is strongly required. An energy efficient real-time object tracking system on both static and moving camera is proposed in this paper. The system reduces the computational cost and explores data reuse by optimizing the tracking algorithm, the data flow, and the parallelism strategies. The architecture is implemented on a Xilinx ZC706 FPGA, and the experimental data shows that the system obtains 41 frame/s throughput for the 640×480 video and achieves higher energy efficiency comparing to other similar works.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114820305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hitoshi Ito, Aiko Hagiwara, Manon Ichiki, T. Mishima, Shoei Sato, A. Kobayashi
{"title":"End-to-end speech recognition for languages with ideographic characters","authors":"Hitoshi Ito, Aiko Hagiwara, Manon Ichiki, T. Mishima, Shoei Sato, A. Kobayashi","doi":"10.1109/APSIPA.2017.8282226","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282226","url":null,"abstract":"This paper describes a novel training method for acoustic models using connectionist temporal classification (CTC) for Japanese end-to-end automatic speech recognition (ASR). End-to-end ASR can estimate characters directly without using a pronunciation dictionary; however, this approach was conducted mostly in the English research area. When dealing with languages such as Japanese, we confront difficulties with robust acoustic modeling. One of the issues is caused by a large number of characters, including Japanese kanji, which leads to an increase in the number of model parameters. Additionally, multiple pronunciations of kanji increase the variance of acoustic features for corresponding characters. Therefore, we propose end-to-end ASR based on bi-directional long short-term memory (BLSTM) networks to solve these problems. Our proposal involves two approaches: reducing the number of dimensions of BLSTM and adding character strings to output layer labels. Dimensional compression decreases the number of parameters, while output label expansion reduces the variance of acoustic features. Consequently, we could obtain a robust model with a small number of parameters. Our experimental results with Japanese broadcast programs show the combined method of these two approaches improved the word error rate significantly compared with the conventional character-based end-to-end approach.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117254434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}