Raga Shalini Koka, Farah Naz Chowdhury, Mohammad Rajiur Rahman, T. Solorio, J. Subhlok
{"title":"Automatic Identification of Keywords in Lecture Video Segments","authors":"Raga Shalini Koka, Farah Naz Chowdhury, Mohammad Rajiur Rahman, T. Solorio, J. Subhlok","doi":"10.1109/ISM.2020.00035","DOIUrl":"https://doi.org/10.1109/ISM.2020.00035","url":null,"abstract":"Lecture video is an increasingly important learning resource. However, the challenge of quickly finding the content of interest in a long lecture video is a critical limitation of this format. This paper introduces automatic discovery of keywords (or tags) for lecture video segments to improve navigation. A lecture video is divided into topical segments based on the frame-to-frame similarity of content. A user navigates the lecture video assisted by visual summaries and keywords for the segments. Keywords provide an overview of the content discussed in the segment to improve navigation. The input to the keyword identification algorithm is the text from the video frames extracted by OCR. Automatically discovering keywords is challenging as the suitability of an N-gram to be a keyword depends on a variety of factors including frequency in a segment and relative frequency in reference to the full video, font size, time on screen, and the existence in domain and language dictionaries. This paper explores how these factors are quantified and combined to identify good keywords. The key scientific contribution of this paper is the design, implementation, and evaluation of a keyword selection algorithm for lecture video segments. Evaluation is performed by comparing the keywords generated by the algorithm with the tags chosen by experts on 121 segments of 11 videos from STEM courses.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120940809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Fremerey, Frank Hofmeyer, Steve Göring, Dominik Keller, A. Raake
{"title":"Between the Frames - Evaluation of Various Motion Interpolation Algorithms to Improve 360° Video Quality","authors":"S. Fremerey, Frank Hofmeyer, Steve Göring, Dominik Keller, A. Raake","doi":"10.1109/ISM.2020.00017","DOIUrl":"https://doi.org/10.1109/ISM.2020.00017","url":null,"abstract":"With the increasing availability of 360° video content, it becomes important to provide smoothly playing videos of high quality for end users. For this reason, we compare the influence of different Motion Interpolation (MI) algorithms on 360° video quality. After conducting a pre-test with 12 video experts in [3], we found that MI is a useful tool to increase the QoE (Quality of Experience) of omnidirectional videos. As a result of the pretest, we selected three suitable MI algorithms, namely ffmpeg Motion Compensated Interpolation (MCI), Butterflow and Super-SloMo. Subsequently, we interpolated 15 entertaining and realworld omnidirectional videos with a duration of 20 seconds from 30 fps (original framerate) to 90 fps, which is the native refresh rate of the HMD used, the HTC Vive Pro. To assess QoE, we conducted two subjective tests with 24 and 27 participants. In the first test we used a Modified Paired Comparison (M-PC) method, and in the second test the Absolute Category Rating (ACR) approach. In the M-PC test, 45 stimuli were used and in the ACR test 60. Results show that for most of the 360° videos, the interpolated versions obtained significantly higher quality scores than the lower-framerate source videos, validating our hypothesis that motion interpolation can improve the overall video quality for 360° video. As expected, it was observed that the relative comparisons in the M-PC test result in larger differences in terms of quality. Generally, the ACR method lead to similar results, while reflecting a more realistic viewing situation. In addition, we compared the different MI algorithms and can conclude that with sufficient available computing power Super-SloMo should be preferred for interpolation of omnidirectional videos, while MCI also shows a good performance.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125993669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Subpicture-based Viewport-dependent 360-degree Video Streaming using VVC","authors":"Maryam Homayouni, A. Aminlou, M. Hannuksela","doi":"10.1109/ISM.2020.00021","DOIUrl":"https://doi.org/10.1109/ISM.2020.00021","url":null,"abstract":"Virtual reality applications create an immersive experience using 360° video with high resolution and frame rate. However, since the user only views a portion of 360° video according to his/her current viewport, streaming the whole content with high resolution causes bandwidth wastage. To address this issue, viewport-dependent approaches have been proposed such that only the part of the video which falls within user's current viewport is transmitted in high quality while the rest of the content is transmitted in lower quality. The selection of high- and low-quality parts is constantly adapted according to the user's head motion, which requires frequent intra coded frames at switching points, leading to an increment in the overall streaming bitrate. In this paper a viewport-adaptive streaming scheme is introduced, which avoids intra frames at switching points by introducing long intra period for non-changing parts of the content during head motion. This scheme has been realized taking advantage of mixed Video Coding Layer (VCL) Network Abstraction Layer (NAL) unit feature of Versatile Video Coding (VVC) standard. This method reduces bitrate significantly, especially for the sequences with either no or only slow camera motion, which is common for 360° video capturing.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132819889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher B. Kuhn, M. Hofbauer, G. Petrovic, E. Steinbach
{"title":"Better Look Twice - Improving Visual Scene Perception Using a Two-Stage Approach","authors":"Christopher B. Kuhn, M. Hofbauer, G. Petrovic, E. Steinbach","doi":"10.1109/ISM.2020.00013","DOIUrl":"https://doi.org/10.1109/ISM.2020.00013","url":null,"abstract":"Accurate visual scene perception plays an important role in fields such as medical imaging or autonomous driving. Recent advances in computer vision allow for accurate image classification, object detection and even pixel-wise semantic segmentation. Human vision has repeatedly been used as an inspiration for developing new machine vision approaches. In this work, we propose to adapt the “zoom lens model” from psychology for semantic scene segmentation. According to this model, humans first distribute their attention evenly across the entire field of view at low processing power. Then, they follow visual cues to look at a few smaller areas with increased attention. By looking twice, it is possible to refine the initial scene understanding without requiring additional input. We propose to perform semantic segmentation the same way. To obtain visual cues for deciding where to look twice, we use a failure region prediction approach based on a state-of-the-art failure prediction method. Then, the second, focused look is performed by a dedicated classifier that reclassifies the most challenging patches. Finally, pixels predicted to be errors are updated in the original semantic prediction. While focusing only on areas with the highest predicted failure probability, we achieve a classification accuracy of over 63% for the predicted failure regions. After updating the initial semantic prediction of 4000 test images from a large-scale driving data set, we reduce the absolute pixel-wise error of 232 road participants by 10% or more.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115890073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme H. S. Nakahata, A. A. Constantino, Yandre M. G. Costa
{"title":"Bonsai Style Classification: a new database and baseline results","authors":"Guilherme H. S. Nakahata, A. A. Constantino, Yandre M. G. Costa","doi":"10.1109/ISM.2020.00025","DOIUrl":"https://doi.org/10.1109/ISM.2020.00025","url":null,"abstract":"Bonsai consists of an ancient art which is aimed at mimicking a tree in miniature. Despite being original and popular on the Asian continent, Bonsai has been widespread in several parts of the world. There are many techniques for styling the plants, classifying them in different patterns widely known by people who appreciate this art. In this work, we introduce a new database specially created for the development of research on Bonsai styles classification. The database is composed of 700 samples, equally distributed among the seven following classes: Formal Upright, Informal Upright, Slanting, Cascade, Semi Cascade, Literati and Wind Swept. The classes selected to compose the database were chosen considering the five basic styles and two more styles that have distinct characteristics from the others. The database was created by the authors themselves, using images available on the Pinterest platform, and they were subjected to a pre-processing criteria to remove similar photos and resize them. The baseline results presented here were obtained using deep models (CNN architectures) successfully used to address image classification tasks in different application domains: VGG, Xception, DenseNet and InceptionV3. These models were trained on ImageNet and we used transfer learning aiming to adapt it to the current proposal. In order to avoid overfitting, data augmentation was performed during training, along with the dropout method. Experimental results showed that VGG19 model obtained the highest accuracy rate, reaching 89%. In addition, we used DeconvNet and Deep Taylor methods aiming to find a proper explanation for the obtained results. It was noted that the VGG19 model better captured the most important aspects for the classification task investigated here, with a better performance to discriminate and generalize patterns in the task of classifying Bonsai styles.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124178091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hofbauer, Christopher B. Kuhn, Lukas Püttner, G. Petrovic, E. Steinbach
{"title":"Measuring Driver Situation Awareness Using Region-of-Interest Prediction and Eye Tracking","authors":"M. Hofbauer, Christopher B. Kuhn, Lukas Püttner, G. Petrovic, E. Steinbach","doi":"10.1109/ISM.2020.00022","DOIUrl":"https://doi.org/10.1109/ISM.2020.00022","url":null,"abstract":"With increasing progress in autonomous driving, the human does not have to be in control of the vehicle for the entire drive. A human driver obtains the control of the vehicle in case of an autonomous system failure or when the vehicle encounters an unknown traffic situation it cannot handle on its own. A critical part of this transition to human control is to ensure a sufficient driver situation awareness. Currently, no direct method to explicitly estimate driver awareness exists. In this paper, we propose a novel system to explicitly measure the situation awareness of the driver. Our approach is inspired by methods used in aviation. However, in contrast to aviation, the situation awareness in driving is determined by the detection and understanding of dynamically changing and previously unknown situation elements. Our approach uses machine learning to define the best possible situation awareness. We also propose to measure the actual situation awareness of the driver using eye tracking. Comparing the actual awareness to the target awareness allows us to accurately assess the awareness the driver has of the current traffic situation. To test our approach, we conducted a user study. We measured the situation awareness score of our model for 8 unique traffic scenarios. The results experimentally validate the accuracy of the proposed driver awareness model.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131050339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Sparsity-Aware Recognition for Keypoint Detection","authors":"Yurui Xie, L. Guan","doi":"10.1109/ISM.2020.00029","DOIUrl":"https://doi.org/10.1109/ISM.2020.00029","url":null,"abstract":"We present a novel Sparsity-Aware Keypoint detector (SAKD) to localize a set of discriminative keypoints via optimization of group-sparse coding. Unlike most of current handcrafted keypoint detectors that are limited by the manually defined local structures, the proposed method has the capacity to allow flexibility for exploiting diverse structures with the combination of visual atoms from a vocabulary. Another key valuable attribute is that its group-sparsity nature concentrates on discovering sharable structural patterns across keypoints within an image jointly. This main merit facilitates to localize repeatable keypoints and resists against distractors when image undergoes various transformations. Extensive experiments on four challenging benchmark datasets demonstrate that the proposed method achieves favorable performances compared with state-of-the-art in literature.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122550134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Nisticò, Dena Markudova, Martino Trevisan, M. Meo, G. Carofiglio
{"title":"A comparative study of RTC applications","authors":"A. Nisticò, Dena Markudova, Martino Trevisan, M. Meo, G. Carofiglio","doi":"10.1109/ISM.2020.00007","DOIUrl":"https://doi.org/10.1109/ISM.2020.00007","url":null,"abstract":"Real-Time Communication (RTC) applications have become ubiquitous and are nowadays fundamental for people to communicate with friends and relatives, as well as for enterprises to allow remote working and save travel costs. Countless competing platforms differ in the ease of use, features they implement, supported user equipment and targeted audience (consumer of business). However, there is no standard protocol or interoperability mechanism. This picture complicates the traffic management, making it hard to isolate RTC traffic for prioritization or obstruction. Moreover, undocumented operation could result in the traffic being blocked at firewalls or middleboxes. In this paper, we analyze 13 popular RTC applications, from widespread consumer apps, like Skype and Whatsapp, to business platforms dedicated to enterprises - Microsoft Teams and Webex Teams. We collect packet traces under different conditions and illustrate similarities and differences in their use of the network. We find that most applications employ the well-known RTP protocol, but we observe a few cases of different (and even undocumented) approaches. The majority of applications allow peer-to-peer communication during calls with only two participants. Six of them send redundant data for Forward Error Correction or encode the user video at different bitrates. In addition, we notice that many of them are easy to identify by looking at the destination servers or the domain names resolved via DNS. The packet traces we collected, along with the metadata we extract, are made available to the community.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125319626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computational Method for Optimal Attack Play Consisting of Run Plays and Hand-pass Plays for Seven-a-side Rugby","authors":"Kotaro Yashiro, Yohei Nakada","doi":"10.1109/ISM.2020.00031","DOIUrl":"https://doi.org/10.1109/ISM.2020.00031","url":null,"abstract":"Providing explanatory information during broadcast of team sports is becoming important to make rules, plays, tactics, and developments easier to understand for viewers, particularly for beginners. Against this background, this paper presents a computational method for selecting the optimal attack play for a try, considering run and hand-pass plays. In this method, attack plays consisting of runs and hand-passes are simulated from the current player position and speed data based on motion models for each player and the ball. We then evaluate the simulated attack plays. The optimal attack play can be obtained using the branch-and-bound algorithm based on the evaluation results. In this study, the proposed method is validated using four synthetic formation examples of seven-a-side rugby as an initial validation. Displaying the optimal attack plays computed using the proposed method can help viewers understand developments in games more easily.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129780001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/ism.2020.00003","DOIUrl":"https://doi.org/10.1109/ism.2020.00003","url":null,"abstract":"","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"495 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116193822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}