{"title":"Domain Adaptation and Language Conditioning to Improve Phonetic Posteriorgram Based Cross-Lingual Voice Conversion","authors":"Pin-Chieh Hsu, N. Minematsu, D. Saito","doi":"10.23919/APSIPAASC55919.2022.9979918","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979918","url":null,"abstract":"In this work, we examine two methods for im-proving phonetic posteriorgram (PPG) based cross-lingual voice conversion (CLV C). Previous research usually utilized a speaker encoder to characterize speakers' identity; however, the speaker embedding learned by the previous model tends to be language- dependent, degrading the performance of converted speeches. Therefore, we propose using the technique of domain-adversarial training. With this approach, the speaker embedding in different languages can be adapted into the same distribution to form a language-independent speaker embedding space. The other approach we propose is to employ external language conditioning to support our model to disentangle the language information from the speaker embedding. In our experiments, both methods are evaluated on a Japanese-English bilingual database. Besides subjective evaluation, two automatic objective assessment systems are adopted to assess the quality and speaker similarity of converted utterances. According to the experimental results, the two proposed methods can generate speaker embedding with reduced language dependency and improve the naturalness and speaker similarity of converted speeches.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114278561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Physiological study on the effect of game events in response to player's laughter","authors":"Mikito Fukuda, Y. Arimoto","doi":"10.23919/APSIPAASC55919.2022.9979868","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979868","url":null,"abstract":"To investigate whether computer's automatic responses to our emotional expression influences our cognitive and emotional involvement in a virtual world, this study examined to measure the player's physiological reactions to game events presented in response to the players' spontaneous laughter. Participants played two conditional virtual games in our experiments, and their electrocardiogram, electrodermal activity, and facial electromyography (corrugator supercilii muscle and zygomaticus major muscle) were recorded during the games. The experiment consisted of two conditions, namely advantageous event condition and disadvantageous event condition. In the advantageous event condition, the system responded to the player's laughter with an event that benefitted the player. In the disadvantageous event condition, the system responded to the player's laughter with an event that annoyed the player. A three-way analysis of variance was performed using these physiological signals to test the hypothesis that there is time-series variation in physiological responses between both event types and event durations. As a result, a significantly slower heart rate was observed after the presentation of an event in both the advantageous/disadvantageous event conditions. This result suggests that the players paid more attention to the game when any event was generated against their laughter. Moreover, both type of events to the player's laughter more activated electrodermal activity and corrugator supercilii muscle. In particular, the disadvantageous events to the player's laughter more activated corrugator supercilii muscle than the advantageous event. These results suggest that players were more emotionally engaged in the game when they encountered troublesome or fortunate situations while laughing.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128655295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Pre-Trained Acoustic Feature Extractor For Affective Vocal Bursts Tasks","authors":"Bagus Tris Atmaja, A. Sasou","doi":"10.23919/APSIPAASC55919.2022.9980083","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980083","url":null,"abstract":"Understanding humans' emotions is a challenge for computers. Nowadays, research on speech emotion recognition has been conducted progressively. Instead of a speech, affective information may lay on short vocal bursts (i.e., cry when sad). In this study, we evaluated a recent self-supervised learning model to extract acoustic embedding for affective vocal bursts tasks. There are four tasks investigated on both regression and classification problems. Using similar architectures, we found the effectiveness of using a pre-trained model over the baseline methods. The study is further expanded to evaluate the different number of seeds, patiences, and batch sizes on the performance of the four tasks.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"326 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nisawan Ngambenjavichaikul, Sovann Chen, S. Aramvith
{"title":"Optimal Deep Multi-Route Self-Attention for Single Image Super-Resolution","authors":"Nisawan Ngambenjavichaikul, Sovann Chen, S. Aramvith","doi":"10.23919/APSIPAASC55919.2022.9979962","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979962","url":null,"abstract":"Image restoration, such as single image super-resolution (SISR), is a long-established low-level vision issue that intends to regenerate high-resolution (HR) images from low-resolution (LR) input counterparts. While state-of-the-art image super-resolution models are based on the well-known convolutional neural network (CNN), many self-attention-based or transformer-based experiment attempts have been conducted and have shown promising performance on vision problems. A powerful baseline model based on the swin transformer adopts the shifted window approach. It enhances the capability by restricting the model to compute the self-attention function only on non-superimpose local windows while enabling cross-window relations. However, the architecture design is manually fixed. Therefore, the results are not achieving optimal performance. This paper presents an optimal deep multi-route self-attention network for single image super-resolution (ODMR-SASR). The genetic algorithm (GA) is introduced to discover the optimal number of filters and layers. Experimental results demonstrate that the proposed optimization technique can produce a progressive SR image quality.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"46 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113974158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ingon Chanpornpakdi, Motoi Noda, Toshihisa Tanaka, Yuval Harpaz, A. Geva
{"title":"Clustering of advertising images using electroencephalogram","authors":"Ingon Chanpornpakdi, Motoi Noda, Toshihisa Tanaka, Yuval Harpaz, A. Geva","doi":"10.23919/APSIPAASC55919.2022.9980161","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980161","url":null,"abstract":"Packaging and advertisements of brands affect customers' decision-making on purchasing products and could lead to business loss. Hence, neuromarketing, the application of neuroscience in the marketing field, is introduced aiming to understand customers' cognitive functions toward advertisements or products. Our study focused on identifying how the brain respond to different types of advertising image of the same brand were perceived using electroencephalogram (EEG). We performed an experiment using 33 different Coca-Cola advertising images in RSVP (rapid serial visual presentation) task on 23 participants. A seven channels EEG dry headset was used to record the visual event-related potential (ERP), specifically, the positive peak found at 300 to 700 ms after image onset; P300, to compare the perception response. We applied k-means and hierarchical clustering to the obtained EEG data, and achieved the best clustering for three clusters, yielding different P300 amplitudes and latencies. The typical Coca-Cola ads, red color with Cola-cola text on the ads, induced a faster and larger response, implying better perception than the unconventional or black color ads. We conclude that ERP clustering may be a useful tool for neuromarketing. However, the relationship between the EEG-based cluster and the image-based cluster should be further investigated to confirm the suggestion.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132213783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianyin Fan, Haoran Xu, Yuwei Du, Jing Jin, Qiang Wang
{"title":"Design and Control of a Muscle-skeleton Robot Elbow based on Reinforcement Learning","authors":"Jianyin Fan, Haoran Xu, Yuwei Du, Jing Jin, Qiang Wang","doi":"10.23919/APSIPAASC55919.2022.9980219","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980219","url":null,"abstract":"The muscle-skeleton body structure and learning ability allow natural creatures to adapt to the complex environment. These can also make robots more adaptive in human-robot interaction scenarios. In this work, we implement a humanoid muscle-skeleton robot elbow joint actuated by two antagonistic pneumatic artificial muscles (PAMs). A reinforcement learning algorithm based on soft actor-critic (SAC) is adopted to learn the control policy of the proposed elbow joint. Lower action space and hindsight experience replay (HER) further reduce training time, and the temperature factor is fixed during the training process for small steady-state error. An elbow model is implemented in the simulation to verify the training procedure for our real robot elbow platform. The experimental results show that the RL learning procedure can learn control policies in the robot elbow prototype, and the steady-state error is within 0.64% after 1 s of control time.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"343 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134202147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Ren, Zijie Guo, Zhimin Zhang, Rui Zhu, Xiaoxu Li
{"title":"Multi-Branch Network for Few-shot Learning","authors":"Kai Ren, Zijie Guo, Zhimin Zhang, Rui Zhu, Xiaoxu Li","doi":"10.23919/APSIPAASC55919.2022.9980160","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980160","url":null,"abstract":"Few-shot learning aims provide precise predictions for unseen data through learning from only one or few labelled samples of each class. However, it often suffers from the overfitting problem because of insufficient training data. In this paper, we propose a novel metric-based few-shot learning method, multi-branch network (MBN), with a new data augmentation module to improve the generalization ability of the model. Specifically, we generate different types of noise contaminated data through multiple branches in the network to simulate the real-world scenarios when noisy images are obtained. Following this novel data augmentation module, the feature embedding and similarities between the support and query samples are learned simultaneously through the embedding and metric modules, respectively. Moreover, to consider more details in the feature maps, we propose to utilize the average-pooling layer in the metric module rather than the commonly adopted max-pooling layer. The network is trained from end to end by the Kullback- Leibler (KL) divergence, to minimize the difference between the distributions of the ground truths and predictions. Extensive experiments on Standford-Dogs, Standford-Cars, CUB-200-2011 and mini-ImageNet in the 1-shot and 5-shot tasks demonstrate the superior classification performance of MBN.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131498888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sound Reproduction with a Circular Loudspeaker Array Using Differential Beamforming Method","authors":"Yankai Zhang, Jiayi Mao, Yefeng Cai, C. Ye","doi":"10.23919/APSIPAASC55919.2022.9980128","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980128","url":null,"abstract":"This paper proposes an approach to get frequency invariant, symmetric beampattern using a compact circular loudspeaker array. The Jacobi-Anger expansion method is used to approximate the target beampattern. The simulated performance is compared of the same circular loudspeaker array with and without a rigid baffle. The analytical solution of the weight and the simulation results show that the circular loudspeaker array with a rigid baffle can overcome the null problem confronting the array without a rigid baffle. The minimum-norm filter is used to improve the robustness of the system and maintain the frequency-invariant beampattern over the frequency range of interest.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"20 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131775067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chien-Han Hsu, Yi-Hsien Lin, Yen-Po Lin, Yi-Chang Lu
{"title":"A Multiframe Super-resolution Pipeline for Sub-image-typed Light Field Data","authors":"Chien-Han Hsu, Yi-Hsien Lin, Yen-Po Lin, Yi-Chang Lu","doi":"10.23919/APSIPAASC55919.2022.9980305","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980305","url":null,"abstract":"Due to the trade-off between spatial and angular resolutions in light field cameras, the obtained resolutions of synthesized 2D images are often far less than those captured by conventional digital cameras using the same image sensor. This work proposes a complete digital image processing pipeline for hand-held light field cameras to generate high-resolution all-in-focus 2D images. The flow contains refined disparity estimation, digital refocusing, and super-resolution stages in which the characteristics of light fields are considered. We adopt the efficient first-order primal-dual algorithm as our optimization tool. The results show that the proposed approach gives better image quality when compared to other existing super-resolution methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129393268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eunji Lee, Junhyeong Kwon, Haeyoon Yang, Jaewoo Park, Soonyoung Lee, H. Koo, N. Cho
{"title":"Table Structure Recognition Based on Grid Shape Graph","authors":"Eunji Lee, Junhyeong Kwon, Haeyoon Yang, Jaewoo Park, Soonyoung Lee, H. Koo, N. Cho","doi":"10.23919/APSIPAASC55919.2022.9980172","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980172","url":null,"abstract":"Since tables in documents provide important information in compact form, table understanding has been an essential topic in document image processing. Researchers represented table structures in various formats for table understanding, such as simple grid structure, a graph with text/cell boxes as nodes, or a sequence of HTML tokens. However, these approaches have difficulties in handling regularities, e.g., global row and column information, and spanning cells simultaneously. In this paper, we propose a new table recognition method based on a grid shape graph and present grid localization and grid elements grouping networks. This approach is designed to exploit the grid structure and deal with spanning cells. To convert grid structure into cell structure, we only have to test adjacent pairs of grid elements, enabling efficient inference. In addition, we have discovered that predicting row/column-based relationships between grid elements improve cell-based connectivity estimation performance. We demonstrate the effectiveness of the proposed method through experiments on three benchmark datasets.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130872479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}