Jinhong Lu, Tianhang Liu, Shuzhuang Xu, H. Shimodaira
{"title":"Double-DCCCAE: Estimation of Body Gestures From Speech Waveform","authors":"Jinhong Lu, Tianhang Liu, Shuzhuang Xu, H. Shimodaira","doi":"10.1109/ICASSP39728.2021.9414660","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414660","url":null,"abstract":"This paper presents an approach for body-motion estimation from audio-speech waveform, where context information in both input and output streams is taken in to account without using recurrent models. Previous works commonly use multiple frames of input to estimate one frame of motion data, where the temporal information of the generated motion is little considered. To resolve the problems, we extend our previous work and propose a system, double deep canonical-correlation-constrained autoencoder (D-DCCCAE), which encodes each of speech and motion segments into fixed-length embedded features that are well correlated with the segments of the other modality. The learnt motion embedded feature is estimated from the learnt speech-embedded feature through a simple neural network and further decoded back to the sequential motion. The proposed pair of embedded features showed higher correlation than spectral features with motion data, and our model was more preferred than the baseline model (BA) in terms of human-likeness and comparable in terms of similar appropriateness.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaxi Yang, Hailin Wang, Haiquan Qiu, Jianjun Wang, Yao Wang
{"title":"Non-Convex Sparse Deviation Modeling Via Generative Models","authors":"Yaxi Yang, Hailin Wang, Haiquan Qiu, Jianjun Wang, Yao Wang","doi":"10.1109/ICASSP39728.2021.9414170","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414170","url":null,"abstract":"In this paper, the generative model is used to introduce the structural properties of the signal to replace the common sparse hypothesis, and a non-convex compressed sensing sparse deviation model based on the generative model (ℓq-Gen) is proposed. By establishing ℓq variant of the restricted isometry property (q-RIP) and Set-Restricted Eigenvalue Condition (q-S-REC), the error upper bound of the optimal decoder is derived when the recovered signal is within the sparse deviation range of the generator. Furthermore, it is proved that the Gaussian matrix satisfying a certain number of measurements is sufficient to ensure a good recovery for the generating function with high probability. Finally, a series of experiments are carried out to verify the effectiveness and superiority of the ℓq-Gen model.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117016966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangang Cai, Ronggang Wang, Song Gu, Jian Zhang, Wen Gao
{"title":"An Adaptive Pyramid Single-View Depth Lookup Table Coding Method","authors":"Yangang Cai, Ronggang Wang, Song Gu, Jian Zhang, Wen Gao","doi":"10.1109/ICASSP39728.2021.9414584","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414584","url":null,"abstract":"As depth maps show unique characteristics like piecewise smooth regions bounded by sharp edges at depth discontinuities, new coding tools are required to approximate these signal characteristics. Moreover, the number of bits to signal the residual values for each segment can be further reduced by integrating a Depth Lookup Table (DLT), which maps depth values to valid depth values of the original depth map. The DLT is constructed based on an initial analysis of the input depth map and is then coded in the sequence header. In this paper, an adaptive pyramid single-view depth lookup table coding method is proposed, with the purpose of designing a clean syntax structure in the sequence header with reasonably good performance. Experiments show that the proposed method can reduce about 84.97% coding bits on average.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117270011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instrument Classification of Solo Sheet Music Images","authors":"Kevin Ji, Daniel Yang, T. Tsai","doi":"10.1109/ICASSP39728.2021.9413732","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413732","url":null,"abstract":"This paper studies instrument classification of solo sheet music. Whereas previous work has focused on instrument recognition in audio data, we instead approach the instrument classification problem using raw sheet music images. Our approach first converts the sheet music image into a sequence of musical words based on the bootleg score representation, and then treats the problem as a text classification task. We show that it is possible to significantly improve classifier performance by training a language model on unlabeled data, initializing a classifier with the pretrained language model weights, and then finetuning the classifier on labeled data. In this work, we train AWD-LSTM, GPT-2, and RoBERTa models on solo sheet music images from IMSLP for eight different instruments. We find that GPT-2 and RoBERTa slightly outperform AWD-LSTM, and that pretraining increases classification accuracy for RoBERTa from 34.5% to 42.9%. Furthermore, we propose two data augmentation methods that increase classification accuracy for RoBERTa by an additional 15%.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128638017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-Time Radio Modulation Classification With An LSTM Auto-Encoder","authors":"Ziqi Ke, H. Vikalo","doi":"10.1109/ICASSP39728.2021.9414351","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414351","url":null,"abstract":"Identifying modulation type of a received radio signal is a challenging problem encountered in many applications including radio interference mitigation and spectrum allocation. This problem is rendered challenging by the existence of a large number of modulation schemes and numerous sources of interference. Existing methods for monitoring spectrum readily collect large amounts of radio signals. However, existing state-of-the-art approaches to modulation classification struggle to reach desired levels of accuracy with computational efficiency practically feasible for implementation on low-cost computational platforms. To this end, we propose a learning framework based on an LSTM denoising autoencoder designed to extract robust and stable features from the noisy received signals, and detect the underlying modulation scheme. The method uses a compact architecture that may be implemented on low-cost computational devices while achieving or exceeding state-of-the-art classification accuracy. Experimental results on realistic synthetic and over-the-air radio data show that the proposed framework reliably and efficiently classifies radio signals, and often significantly outperform state-of-the-art approaches.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"84 Pt 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129006633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applied Methods for Sparse Sampling of Head-Related Transfer Functions","authors":"Lior Arbel, Z. Ben-Hur, D. Alon, B. Rafaely","doi":"10.1109/ICASSP39728.2021.9413976","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413976","url":null,"abstract":"Production of high fidelity spatial audio applications requires individual head-related transfer functions (HRTFs). As the acquisition of HRTF is an elaborate process, interest lies in interpolating full length HRTF from sparse samples. Ear-alignment is a recently developed pre-processing technique, shown to reduce an HRTF’s spherical harmonics order, thus permitting sparse sampling over fewer directions. This paper describes the application of two methods for ear-aligned HRTF interpolation by sparse sampling: Orthogonal Matching Pursuit and Principal Component Analysis. These methods consist of generating unique vector sets for HRTF representation. The methods were tested over an HRTF dataset, indicating that interpolation errors using small sampling schemes may be further reduced by up to 5 dB in comparison with spherical harmonics interpolation.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"94 2 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129454743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pratyush Garg, R. Ranjan, Kamini Upadhyay, M. Agrawal, D. Deepak
{"title":"Multi-Scale Residual Network for Covid-19 Diagnosis Using Ct-Scans","authors":"Pratyush Garg, R. Ranjan, Kamini Upadhyay, M. Agrawal, D. Deepak","doi":"10.1109/ICASSP39728.2021.9414426","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414426","url":null,"abstract":"To mitigate the outbreak of highly contagious COVID-19, we need a sensitive, robust automated diagnostic tool. This paper proposes a three-level approach to separate the cases of COVID-19, pneumonia from normal patients using chest CT scans. At the first level, we fine tune a multi-scale ResNet50 model for feature extraction from all the slices of CT scan for each patient. By using multi-scale residual network, we can learn different sizes of infection, thereby making the detection possible at early stages too. These extracted features are used to train a patient-level classifier, at the second level. Four different classifiers are trained at this stage. Finally, predictions of patient level classifiers are combined by training an ensemble classifier. We test the proposed method on three sets of data released by ICASSP, COVID-19 Signal Processing Grand Challenge (SPGC). The proposed method has been successful in classifying the three classes with a validation accuracy of 94.9% and testing accuracy of 88.89%.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128991469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanmeng Wang, Ye Wang, Xingyu Lou, Wenge Rong, Zhenghong Hao, Shaojun Wang
{"title":"Improving Dialogue Response Generation Via Knowledge Graph Filter","authors":"Yanmeng Wang, Ye Wang, Xingyu Lou, Wenge Rong, Zhenghong Hao, Shaojun Wang","doi":"10.1109/ICASSP39728.2021.9414324","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414324","url":null,"abstract":"Current generative dialogue systems tend to produce generic dialog responses, which lack useful information and semantic coherence. An promising method to alleviate this problem is to integrate knowledge triples from knowledge base. However, current approaches mainly augment Seq2Seq framework with knowledge-aware mechanism to retrieve a large number of knowledge triples without considering specific dialogue context, which probably results in knowledge redundancy and incomplete knowledge comprehension. In this paper, we propose to leverage the contextual word representation of dialog post to filter out irrelevant knowledge with an attention-based triple filter network. We introduce a novel knowledge-enriched framework to integrate the filtered knowledge into the dialogue representation. Entity copy is further proposed to facilitate the integration of the knowledge during generation. Experiments on dialogue generation tasks have shown the proposed framework’s promising potential.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123827616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptual Quality Assessment for Recognizing True and Pseudo 4k Content","authors":"Wenhan Zhu, Guangtao Zhai, Xiongkuo Min, Xiaokang Yang, Xiao-Ping Zhang","doi":"10.1109/ICASSP39728.2021.9414932","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414932","url":null,"abstract":"To meet the imperative demand for monitoring the quality of Ultra High-Definition (UHD) content in multimedia industries, we propose an efficient no-reference (NR) image quality assessment (IQA) metric to distinguish original and pseudo 4K contents and measure the quality of their quality in this paper. First, we establish a database including more than 3000 4K images composed of natural 4K images together with upscaled versions interpolated from 1080p and 720p images by fourteen algorithms. To improve computing efficiency, our model segments the input image and selects three representative patches by local variances. Then, we extract the histogram features and cut-off frequency features in the frequency domain as well as the natural scenes statistic (NSS) based features from the representative patches. Finally, we employ support vector regressor (SVR) to aggregate these extracted features as an overall quality metric to predict the quality score of the target image. Extensive experimental comparisons using seven common evaluation indicators demonstrate that the proposed model outperforms the competitive NR IQA methods and has a great ability to distinguish true and pseudo 4K images.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124224000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Checking PRNU Usability on Modern Devices","authors":"C. Albisani, Massimo Iuliani, Alessandro Piva","doi":"10.1109/ICASSP39728.2021.9413611","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413611","url":null,"abstract":"The image source identification task is mainly addressed by exploiting the unique traces of the sensor pattern noise, that ensure a negligible false alarm rate when comparing patterns extracted from different devices, even of the same brand or model. However, most recent smartphones are equipped with proprietary in-camera processing that can possibly expose unexpected correlated patterns within images belonging to different sensors.In this paper, we first highlight that wrong source attribution can happen on smartphones belonging to the same brand when images are acquired both in default and in bokeh mode. While the bokeh mode is proved to introduce a correlated pattern due to the specific in-camera post-processing, we also show that natural images also expose such issue, even when a reference from flat images is available. Furthermore, different camera models expose different correlation patterns since they are reasonably related to developers’ choices. Then, we propose a general strategy that allows the forensic practitioner to determine whether a questioned device may suffer from these correlated patterns, thus avoiding the risk of false image attribution.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121192517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}