{"title":"Perceptual evaluation of singing quality","authors":"Chitralekha Gupta, Haizhou Li, Ye Wang","doi":"10.1109/APSIPA.2017.8282110","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282110","url":null,"abstract":"A perceptually valid automatic singing evaluation score could serve as a complement to singing lessons, and make singing training more reachable to the masses. In this study, we adopt the idea behind PESQ (Perceptual Evaluation of Speech Quality) scoring metrics, and propose various perceptually relevant features to evaluate singing quality. We correlate the obtained singing quality score, which we term as Perceptual Evaluation of Singing Quality (PESnQ) score, with that given by music-expert human judges, and compare the results with the known baseline systems. It is shown that the proposed PESnQ has a correlation of 0.59 with human ratings, which is an improvement of ∼ 96% over baseline systems.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122039966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word level prosody prediction using large audiobook dataset","authors":"Yanfeng Lu, Chenyu Yang, M. Dong","doi":"10.1109/APSIPA.2017.8282218","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282218","url":null,"abstract":"Prosody modelling is an essential part of the text-to- speech synthesis system. In this paper we propose and investigate a way to leverage public domain audiobook data to do word level prosody modelling. Specifically we base our work on the LibriSpeech project, in which a large quantity of public domain audiobook data from LibriVox were processed, selected and aligned with text. We choose long-short-term-memory recurrent deep neural network as the modelling tool. The input word features spread from phonetic, through syntactic, to semantic layers. The word prosody features include log F0, energy and after-word break. A way of incorporating the word prosody model into the speech synthesis system is also proposed. Experiments show that it is an effective way to leverage large quantity and variety of speech data to do prosody modelling for speech synthesis.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu-Huai Peng, Chin-Cheng Hsu, Yi-Chiao Wu, Hsin-Te Hwang, Yi-Wen Liu, Yu Tsao, H. Wang
{"title":"Fast locally linear embedding algorithm for exemplar-based voice conversion","authors":"Yu-Huai Peng, Chin-Cheng Hsu, Yi-Chiao Wu, Hsin-Te Hwang, Yi-Wen Liu, Yu Tsao, H. Wang","doi":"10.1109/APSIPA.2017.8282112","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282112","url":null,"abstract":"The locally linear embedding (LLE) algorithm has been proven to have high output quality and applicability for voice conversion (VC) tasks. However, the major shortcoming of the LLE-based VC approach is the time complexity (especially in the matrix inversion process) during the conversion phase. In this paper, we propose a fast version of the LLE algorithm that significantly reduces the complexity. In the proposed method, each locally linear patch on the data manifold is described by a pre-computed cluster of exemplars, and thus the major part of on-line computation can be carried out beforehand in the off-line phase. Experimental results demonstrate that the VC performance of the proposed fast LLE algorithm is comparable to that of the original LLE algorithm and that a real-time VC system becomes possible because of the highly reduced time complexity.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124676808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongfei Zhang, Rui Fan, Chao Zhang, G. Wang, Zhe Li
{"title":"SIMD acceleration for HEVC encoding on DSP","authors":"Yongfei Zhang, Rui Fan, Chao Zhang, G. Wang, Zhe Li","doi":"10.1109/APSIPA.2017.8282310","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282310","url":null,"abstract":"As the new generation video coding standard, High Efficient Video Coding (HEVC) significantly improves the video compression efficiency, which is however at the cost of a far more computational payload than the capacity of real-time video applications and general purpose processors. In this paper, we focus on the SIMD-based fast implementation of the HEVC encoder over modern TI Digital Signal Processors (DSPs). We first test the DSP-based HEVC encoder and indentify the most time-consuming encoding modules. Then SIMD instructions are exploited to improve the parallel computing capacity of these modules and thus speed up the encoder. The experimental results show that the proposed implementations can significantly improve the encoding speed of the DSP-based HEVC encoder, with a speedup ratio of 8.38-87.32 over the original C-based encoder and 1.59–6.56 over o3-optimization enabled encoder.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128385304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparison study of information contributions of phonemic contrasts in Mandarin","authors":"Yue Chen, Yanlu Xie, Jinsong Zhang","doi":"10.1109/APSIPA.2017.8282275","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282275","url":null,"abstract":"Phonemic contrasts are the basis of speech communication. Previous studies have indicated that different phonemic contrasts have different information contributions. The inherent relationships between phonemes in information transmission can interpret various phenomena in speech and provide guidance on some studies of linguistic such as diachronic linguistics. To well reveal the distribution structure of phonemes in Chinese, this paper used multidimensional scaling to comparatively analyze the information contributions of Initials and Finals (Chinese sub-syllabic units) in Mandarin. The contributions can be quantitatively measured by functional loads (FLs). The experimental results showed that: a) Initials at the same articulation place with different manners are more likely to have higher values of FLs, Initials with same manners at different place have lower values of FLs. b) Finals sharing same onset vowels and different main vowels tend to have higher values of FLs. c) For both Initials and Finals, the closer articulation places or tongue positions of onset vowels they have, the higher values of FLs they have.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127269520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection of various image operations based on CNN","authors":"Hongshen Tang, R. Ni, Yao Zhao, Xiaolong Li","doi":"10.1109/APSIPA.2017.8282267","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282267","url":null,"abstract":"Over the past years, a number of effective digital image forensic techniques have been proposed. However, most of them design features focused on specific image operation and do binary classification, which are not very reasonable in practice and don't work for detecting other operations. To detect various image operations, in this paper, we propose a carefully crafted CNN model to learn features from the magnified images and do multi-classification automatically. Firstly, the images will be magnified by nearest neighbor interpolation in the preprocessing layer. The property of image operations can be well preserved by the nearest up-sampling. Then, hierarchical representations of different operations are learned via two multi scale convolutional layers. After that, the well-known mlpconv layers are used to enhance the whole architecture's nonlinear modeling ability and finally derive the feature map. Further more, shortcut connections between mlpconv layers allow for increasing the depth of the network while reducing information loss. We present comprehensive experiments on 6 typical image operations. The results show that the proposed method have a good performance both in binary and multi-class detection.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129151220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Min-max IIR filter design for feedback quantizers","authors":"S. Ohno, M. Tariq, M. Nagahara","doi":"10.1109/APSIPA.2017.8282157","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282157","url":null,"abstract":"In networked control systems, transmitted data should be quantized into a relatively small number of bits if the rate of the communication channel is not sufficiently high. We propose a feedback quantizer for an implementable simple quantizer with high precision for networked control. The infinite impulse response (IIR) feedback filter is designed to mitigate the effect of the quantization error under a performance constraint of the feedback control system. Then, the minimum rate that achieves the constraint is numerically obtained. Simulations are provided to show the effectiveness of the proposed quantizer in a networked feedback control system.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125660115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotional statistical parametric speech synthesis using LSTM-RNNs","authors":"Shumin An, Zhenhua Ling, Lirong Dai","doi":"10.1109/APSIPA.2017.8282282","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282282","url":null,"abstract":"This paper studies the methods for emotional statistical parametric speech synthesis (SPSS) using recurrent neural networks (RNN) with long short-term memory (LSTM) units. Two modeling approaches, i.e., emotion-dependent modeling and unified modeling with emotion codes, are implemented and compared by experiments. In the first approach, LSTM-RNN- based acoustic models are built separately for each emotion type. A speaker-independent acoustic model estimated using the speech data from multi-speakers is adopted to initialize the emotion-dependent LSTM-RNNS. Inspired by the speaker code techniques developed for speech recognition and speech synthesis, the second approach builds a unified LSTM-RNN-based acoustic model using the training data of a variety of emotion types. In the unified LSTM-RNN model, an emotion code vector is input to all model layers to indicate the emotion characteristics of current utterance. Experimental results on an emotional speech synthesis database with four emotion types (neutral style, happiness, anger, and sadness) show that both approaches achieve significant better naturalness of synthetic speech than HMM-based emotion- dependent modeling. The emotion-dependent modeling approach outperforms the unified modeling approach and the HMM-based emotion-dependent modeling in terms of the subjective emotion classification rates for synthetic speech. Furthermore, the emotion codes used by the unified modeling approach are capable of controlling the emotion type and intensity of synthetic speech effectively by interpolating and extrapolating the codes in the training set.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132958509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searchable encryption of image based on secret sharing scheme","authors":"A. Kamal, Keiichi Iwamura, Hyunho Kang","doi":"10.1109/APSIPA.2017.8282269","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282269","url":null,"abstract":"Searchable encryption is a technique applied in cryptography that allows specific information in an encrypted content to be searched. The implementation of searchable encryption of images in cloud-based systems with multiple users allows each user to benefit from cloud computing, while the privacy and security of each content of a user cannot be breached by the other users. This is realized by distributing each image using our proposed secret sharing scheme to ensure that only the owner of the encrypted content is able to access it. In this paper, we describe the implementation method and the realization of searchable image encryption in a real-world application.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130888991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grid-free compressive beamforming using a single moving sensor of known trajectory","authors":"Y. Ang, Nam Nguyen, J. P. Lie, W. Gan","doi":"10.1109/APSIPA.2017.8282046","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282046","url":null,"abstract":"Recently, the grid-free compressive sensing (GFCS) approach was proposed to perform direction of arrival (DOA) estimation of sources. With the advancement of estimation techniques using a single sensor with a known trajectory, it is proposed that a GFCS method can be extended to achieve grid- free two-dimensional localization. Through the trajectory of the sensor, the proposed approach extracts the spatial information by first reformulating the single-channel signal into multiple waveforms, where each group of consecutive waveforms satisfying the quasi-stationary condition can be constructed into a virtual array called the sub one sensor array (SOSA). The DOA of the source with respect to each SOSA is then estimated with GFCS. Accordingly, the final location of the source is computed as the point that minimizes the mean square distance to all DOA lines. Numerical and experimental results demonstrate that the proposed approach is able to perform grid-free localization of a sound source.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127916761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}