Petra Budíková, J. Sedmidubský, J. Horvath, P. Zezula
{"title":"Towards Scalable Retrieval of Human Motion Episodes","authors":"Petra Budíková, J. Sedmidubský, J. Horvath, P. Zezula","doi":"10.1109/ISM.2020.00015","DOIUrl":"https://doi.org/10.1109/ISM.2020.00015","url":null,"abstract":"With the increasing availability of human motion data captured in the form of 2D/3D skeleton sequences, more complex motion recordings need to be processed. In this paper, we focus on the similarity-based retrieval of motion episodes - medium-sized skeleton sequences that consist of multiple semantic actions and correspond to some logical motion unit (e.g., a figure skating performance). We examine two orthogonal approaches to the episode-matching task: (1) the deep learning approach that is traditionally used for processing short motion actions, and (2) the motion-word technique that transforms skeleton sequences into a text-like representation. Since the second approach is more promising, we propose a two-phase retrieval scheme that combines mature text-processing techniques with application-specific refinement methods. We demonstrate that this solution achieves promising results in both effectiveness and efficiency, and can be further indexed to implement scalable episode retrieval.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121850938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPEG-DASH users quality of experience enhancement for MOOC videos","authors":"D. Sebai, Emna Mani","doi":"10.1109/ISM.2020.00036","DOIUrl":"https://doi.org/10.1109/ISM.2020.00036","url":null,"abstract":"The Dynamic Adaptive Streaming over HTTP (MPEG-DASH) ensures online videos display of good quality and without interruption. It provides an adequate streaming for each display device and network transmission. This can be very useful for the specific field of Massive Open Online Courses (MOOCs) where learners profit from an exceptional visual experience that improves their commitment level and eases the course assimilation. These MPEG-DASH assets can become more and more advantageous if a good choice of its parameters is made. Being a recent branch, the MPEG-DASH adaptive diffusion presents a research field where the efforts are still limited, even more for MOOC videos. Most of the work published in this sense focus on the Quality of Service (QoS) and the technical specifications of the network transmission. In this paper, we aim to consider the quality of the streamed content that directly impacts the learners quality of Experience (QoE). For this, we develop a content-aware dataset that includes several dashified MOOC videos. These latter are then exploited to study the most appropriate bitrates and segment durations for each type of MOOC videos.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117288225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FID: Frame Interpolation and DCT-based Video Compression","authors":"Yeganeh Jalalpour, Li-Yun Wang, W. Feng, Feng Liu","doi":"10.1109/ISM.2020.00045","DOIUrl":"https://doi.org/10.1109/ISM.2020.00045","url":null,"abstract":"In this paper, we present a hybrid video compression technique that combines the advantages of residual coding techniques found in traditional DCT-based video compression and learning-based video frame interpolation to reduce the amount of residual data that needs to be compressed. Learning-based frame interpolation techniques use machine learning algorithms to predict frames but have difficulty with uncovered areas and non-linear motion. This approach uses DCT-based residual coding only on areas that are difficult for video interpolation and provides tunable compression for such areas through an adaptive selection of data to be encoded. Experimental data for both PSNR and the newer video multi-method assessment fusion (VMAF) metrics are provided. Our results show that we can reduce the amount of data required to represent a video stream compared with traditional video coding while outperforming video frame interpolation techniques in quality.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116757427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio Captioning Based on Combined Audio and Semantic Embeddings","authors":"Aysegül Özkaya Eren, M. Sert","doi":"10.1109/ISM.2020.00014","DOIUrl":"https://doi.org/10.1109/ISM.2020.00014","url":null,"abstract":"Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"348 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115231046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-view Neural Networks for Raw Audio-based Music Emotion Recognition","authors":"Na He, Sam Ferguson","doi":"10.1109/ISM.2020.00037","DOIUrl":"https://doi.org/10.1109/ISM.2020.00037","url":null,"abstract":"In Music Emotion Recognition (MER) research, most existing research uses human engineered audio features as learning model inputs, which require domain knowledge and much effort for feature extraction. We propose a novel end-to-end deep learning approach to address music emotion recognition as a regression problem, using the raw audio signal as input. We adopt multi-view convolutional neural networks as feature extractors to learn feature representations automatically. Then the extracted feature vectors are merged and fed into two layers of Bidirectional Long Short-Term Memory to capture temporal context sufficiently. In this way, our model is capable of recognizing dynamic music emotion without requiring too much workload on domain knowledge learning and audio feature processing. Combined with data augmentation strategies, the experimental results show that our model outperforms the state-of-the-art baseline with a significant margin in terms of R2 score (approximately 16%) on the Emotion in Music Database.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. O. Nedrejord, Vajira Lasantha Thambawita, S. Hicks, P. Halvorsen, M. Riegler
{"title":"Vid2Pix - A Framework for Generating High-Quality Synthetic Videos","authors":"O. O. Nedrejord, Vajira Lasantha Thambawita, S. Hicks, P. Halvorsen, M. Riegler","doi":"10.1109/ISM.2020.00010","DOIUrl":"https://doi.org/10.1109/ISM.2020.00010","url":null,"abstract":"Data is arguably the most important resource today as it fuels the algorithms powering services we use every day. However, in fields like medicine, publicly available datasets are few, and labeling medical datasets require tedious efforts from trained specialists. Generated synthetic data can be to future successful healthcare clinical intelligence. Here, we present a GAN-based video generator demonstrating promising results.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jesús Aguilar Armijo, Babak Taraghi, C. Timmerer, H. Hellwagner
{"title":"Dynamic Segment Repackaging at the Edge for HTTP Adaptive Streaming","authors":"Jesús Aguilar Armijo, Babak Taraghi, C. Timmerer, H. Hellwagner","doi":"10.1109/ISM.2020.00009","DOIUrl":"https://doi.org/10.1109/ISM.2020.00009","url":null,"abstract":"Adaptive video streaming systems typically support different media delivery formats, e.g., MPEG-DASH and HLS, replicating the same content multiple times into the network. Such a diversified system results in inefficient use of storage, caching, and bandwidth resources. The Common Media Application Format (CMAF) emerges to simplify HTTP Adaptive Streaming (HAS), providing a single encoding and packaging format of segmented media content and offering the opportunities of bandwidth savings, more cache hits, and less storage needed. However, CMAF is not yet supported by most devices. To solve this issue, we present a solution where we maintain the main advantages of CMAF while supporting heterogeneous devices using different media delivery formats. For that purpose, we propose to dynamically convert the content from CMAF to the desired media delivery format at an edge node. We study the bandwidth savings with our proposed approach using an analytical model and simulation, resulting in bandwidth savings of up to 20% with different media delivery format distributions. We analyze the runtime impact of the required operations on the segmented content performed in two scenarios: (i) the classic one, with four different media delivery formats, and (ii) the proposed scenario, using CMAF-only delivery through the network. We compare both scenarios with different edge compute power assumptions. Finally, we perform experiments in a real video streaming testbed delivering MPEG-DASH using CMAF content to serve a DASH and an HLS client, performing the media conversion for the latter one.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133578238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extraction of Frame Sequences in the Manga Context","authors":"Christian Roggia, Fabio Persia","doi":"10.1109/ISM.2020.00023","DOIUrl":"https://doi.org/10.1109/ISM.2020.00023","url":null,"abstract":"Manga are one of the most popular forms of comics consumed on a global level. Unfortunately, this kind of media was not designed for digital consumption, and consequently its format does not fit well into small areas, such as smartphone screens. In order to cope with this issue, in this paper we propose a novel approach to comics segmentation and sequencing by taking advantage of existing machine learning concepts which are used to generate an artificial intelligence (AI) capable of correctly detecting panels within an image. The regions proposed by the AI are then used to generate a grid that acts as anchor points for a mobile application guiding the reader during navigation and enabling full Manga responsiveness. The developed approach achieves overall better performances in terms of precision and recall, as well as higher fault tolerance than state-of-the-art approaches. The reliability of this method is also considered largely satisfactory for real-world scenarios, so that we are about to finalize an app implementing the method to be spread soon; additionally, future work will be devoted to generalize our approach to all the comics formats.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128780370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio Steganography Algorithm Based on Genetic Algorithm for MDCT Coefficient Adjustment for AAC","authors":"Chen Li, Xiaodong Zhang, Tao Luo, Lihua Tian","doi":"10.1109/ISM.2020.00026","DOIUrl":"https://doi.org/10.1109/ISM.2020.00026","url":null,"abstract":"An AAC steganography algorithm based on genetic algorithm and MDCT coefficient adjustment is proposed. Our algorithm selects the small value region of MDCT coefficient as the embedding bit and the coefficients in codebook 1/2 are designed to change. In order to be against steganalysis better, genetic algorithm is used to optimize the change of the coefficient. The experiment results show that the algorithm has good embedding capacity, high steganography and good imperceptibility.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121108791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Melody-Conditioned Lyrics Generation with SeqGANs","authors":"Yihao Chen, Alexander Lerch","doi":"10.1109/ISM.2020.00040","DOIUrl":"https://doi.org/10.1109/ISM.2020.00040","url":null,"abstract":"Automatic lyrics generation has received attention from both music and AI communities for years. Early rule-based approaches have -due to increases in computational power and evolution in data-driven modelsmostly been replaced with deep-learning-based systems. Many existing approaches, however, either rely heavily on prior knowledge in music and lyrics writing or oversimplify the task by largely discarding melodic information and its relationship with the text. We propose an end-to-end melody-conditioned lyrics generation system based on Sequence Generative Adversarial Networks (SeqGAN), which generates a line of lyrics given the corresponding melody as the input. Furthermore, we investigate the performance of the generator with an additional input condition: the theme or overarching topic of the lyrics to be generated. We show that the input conditions have no negative impact on the evaluation metrics while enabling the network to produce more meaningful results.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125201940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}