V. Ghassab, Kamal Maanicshah, N. Bouguila, Paul Green
{"title":"REP-Model: A deep learning framework for replacing ad billboards in soccer videos","authors":"V. Ghassab, Kamal Maanicshah, N. Bouguila, Paul Green","doi":"10.1109/ISM.2020.00032","DOIUrl":"https://doi.org/10.1109/ISM.2020.00032","url":null,"abstract":"In this paper, we propose a novel framework for replacing advertisement contents in soccer videos with an automatic way by using deep learning strategies. We begin by applying UNET (an image segmentation convolutional neural network technique) for content segmentation and detection. Subsequently, after reconstructing the segmented content in the video frames (considering the apparent loss in detection), we will replace the unwanted content by new one using a homography mapping procedure. Furthermore, the replacement key points in each frame will be tracked into the next frames considering the camera zoom-in and zoom-out controlling. Since the movement of objects in video can disrupt the alignment between frames and correspondingly make the homography matrix calculation erroneous, we use Mask R-CNN to mask and remove the moving objects from the scene. Such framework is denominated as REP-Model which stands for a replacing model.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116452157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hofbauer, Christopher B. Kuhn, G. Petrovic, E. Steinbach
{"title":"Adaptive Multi-View Live Video Streaming for Teledriving Using a Single Hardware Encoder","authors":"M. Hofbauer, Christopher B. Kuhn, G. Petrovic, E. Steinbach","doi":"10.1109/ISM.2020.00008","DOIUrl":"https://doi.org/10.1109/ISM.2020.00008","url":null,"abstract":"Teleoperated driving (TOD) is a possible solution to cope with failures of autonomous vehicles. In TOD, the human operator perceives the traffic situation via video streams of multiple cameras from a remote location. Adaptation mechanisms are needed in order to match the available transmission resources and provide the operator with the best possible situation awareness. This includes the adjustment of individual camera video streams according to the current traffic situation. The limited video encoding hardware in vehicles requires the combination of individual camera frames into a larger superframe video. While this enables the encoding of multiple camera views with a single encoder, it does not allow for rate/quality adaptation of the individual views. To this end, we propose a novel concept that uses preprocessing filters to enable individual rate/quality adaptations in the superframe video. The proposed preprocessing filters allow for the usage of existing multidimensional adaptation models in the same way as for individual video streams using multiple encoders. Our experiments confirm that the proposed concept is able to control the spatial, temporal and quality resolution of individual segments in the superframe video. Additionally, we demonstrate the usability of the proposed method by applying it in a multi-view teledriving scenario. We compare our approach to individually encoded video streams and a multiplexing solution without preprocessing. The results show that the proposed approach produces bitrates for the individual video streams which are comparable to the bitrates achieved with separate encoders. While achieving a similar bitrate for the most important views, our approach requires a total bitrate that is 40% smaller compared to the multiplexing approach without preprocessing.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122322676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ran Xu, Haoliang Wang, Stefano Petrangeli, Viswanathan Swaminathan, S. Bagchi
{"title":"Closing-the-Loop: A Data-Driven Framework for Effective Video Summarization","authors":"Ran Xu, Haoliang Wang, Stefano Petrangeli, Viswanathan Swaminathan, S. Bagchi","doi":"10.1109/ISM.2020.00042","DOIUrl":"https://doi.org/10.1109/ISM.2020.00042","url":null,"abstract":"Today, videos are the primary way in which information is shared over the Internet. Given the huge popularity of video sharing platforms, it is imperative to make videos engaging for the end-users. Content creators rely on their own experience to create engaging short videos starting from the raw content. Several approaches have been proposed in the past to assist creators in the summarization process. However, it is hard to quantify the effect of these edits on the end-user engagement. Moreover, the availability of video consumption data has opened the possibility to predict the effectiveness of a video before it is published. In this paper, we propose a novel framework to close the feedback loop between automatic video summarization and its data-driven evaluation. Our Closing-The-Loop framework is composed of two main steps that are repeated iteratively. Given an input video, we first generate a set of initial video summaries. Second, we predict the effectiveness of the generated variants based on a data-driven model trained on users' video consumption data. We employ a genetic algorithm to search the space of possible summaries (i.e., adding/removing shots to the video) in an efficient way, where only those variants with the highest predicted performance are allowed to survive and generate new variants in their place. Our results show that the proposed framework can improve the effectiveness of the generated summaries with minimal computation overhead compared to a baseline solution - 28.3% more video summaries are in the highest effectiveness class than those in the baseline.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127804687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CooPEC: Cooperative Prefetching and Edge Caching for Adaptive 360° Video Streaming","authors":"A. Mahzari, Aliehsan Samiei, R. Prakash","doi":"10.1109/ISM.2020.00019","DOIUrl":"https://doi.org/10.1109/ISM.2020.00019","url":null,"abstract":"Dynamic Adaptive Streaming over HTTP (DASH) has emerged as the de facto solution for streaming 360°videos. Viewers of 360° videos view only a fraction of each video segment, i.e., the part that corresponds to their Field of View (FoV). To facilitate FoV-adaptive streaming, a segment can be divided into multiple tiles with the FoV corresponding to a subset of tiles. Streaming each segment in its entirety from the video server to a client can incur high communication overheads both in terms of bandwidth and latency. Caching at the network edge can reduce these overheads. However, as edge cache capacity is limited, only a subset of tiles encoded at a subset of supported resolutions may be present in the cache. A viewer, depending on its FoV,may experience cache hit and low download latency for some segments, and a cache miss resulting in high download latency from video server for other segments. This can result in the DASH client unnecessarily triggering quality switches for the following reason: low (high) latency download from edge cache (server, respectively) may be misinterpreted as high (low, respectively) network throughput estimate. In this paper, we propose CooPEC (COOperative Prefetching and Edge Caching), a prefetching and complementary caching solution which uses viewers' FoV entropy to: (i) enable a bitrate oscillation-free video streaming, (ii) reduce core network bandwidth consumption, and (iii) enhance QoE for users.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"2673 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134003144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AR40ER: A Semantic Platform for Open Educational Augmented Reality Resources","authors":"Christian Grévisse, C. Gomes, S. Rothkugel","doi":"10.1109/ISM.2020.00047","DOIUrl":"https://doi.org/10.1109/ISM.2020.00047","url":null,"abstract":"Tablet computers are gaining in presence in modern-day classrooms, enabling the use of a variety of apps for purposes such as note-taking or assessment. Augmented Reality (AR) experiences in the classroom, made possible by current hardware, permit new ways of interaction and visualization, as well as increase student motivation and engagement. They also overcome the need for potentially expensive hardware required for experiments in certain scientific domains. The movement of Open Educational Resources (OER) has enabled the sharing of heterogeneous learning resources. Their retrieval can be improved by enriching their metadata using Semantic Web technologies. In this paper, we present AR40ER, a semantic platform for heterogeneous AR experiences provided as OER. We showcase four AR scenarios from different school subjects. These scenarios can be integrated through a lose coupling in third-party apps. Apart from describing how this integration works, we demonstrate how a note-taking app can benefit from these scenarios.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132911447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuan-Ying Lee, Andrew Yoo, Jounsup Park, K. Nahrstedt
{"title":"Redefine the A in ABR for 360-degree Videos: A Flexible ABR Framework","authors":"Kuan-Ying Lee, Andrew Yoo, Jounsup Park, K. Nahrstedt","doi":"10.1109/ISM.2020.00020","DOIUrl":"https://doi.org/10.1109/ISM.2020.00020","url":null,"abstract":"360-degree video has been popular due to the immersive experience it provides to the viewer. While watching, viewer can control the field of view (FoV)11In this paper, we use viewport interchangeably with FoV in the range of 360° by 180°. As this trend continues, adaptive bitrate (ABR) streaming is becoming a prevalent issue. Most existing ABR algorithms for 360 videos (360 ABR algorithms) require real-time head traces and certain computation resource from the client for streaming, which largely constrains the range of audience. Also, while more 360 ABR algorithms rely upon machine learning (ML) for viewport prediction, ML and ABR are research topics that grow mostly independently. In this paper, we propose a two-fold ABR algorithm for 360 video streaming that utilizes 1) an off-the-shelf ABR algorithm for ordinary videos, and 2) an off-the-shelf viewport prediction model. Our algorithm requires neither real-time head traces nor additional computation from the viewing device. In addition, it adapts easily to the newest developments in viewport prediction and ABR. As a consequence, the proposed method fits nicely to the existing streaming framework and any advancement in viewport prediction and ABR could enhance its performance. With the quantitative experiments, we demonstrate that the proposed method achieves twice the quality of experience (QoE) compared to the baseline.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132021475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Live Demonstration: Interactive Quality of Experience Evaluation in Kvazzup Video Call","authors":"Joni Räsänen, Aaro Altonen, Alexandre Mercat, Jarno Vanne","doi":"10.1109/ISM.2020.00011","DOIUrl":"https://doi.org/10.1109/ISM.2020.00011","url":null,"abstract":"This paper presents an interactive demonstration setup, which allows users to configure the video coding parameters of Kvazzup open-source video call software at runtime and evaluate their impact on Quality of Service (QoS) and Quality of Experience (QoE). The demonstration is carried out by implementing a new Kvazzup control panel for video call parameterization and visual quality, bit rate, latency, and frame rate evaluation.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saba Nazir, Taner Cagali, M. Sadrzadeh, Chris Newell
{"title":"Audiovisual, Genre, Neural and Topical Textual Embeddings for TV Programme Content Representation","authors":"Saba Nazir, Taner Cagali, M. Sadrzadeh, Chris Newell","doi":"10.1109/ISM.2020.00041","DOIUrl":"https://doi.org/10.1109/ISM.2020.00041","url":null,"abstract":"TV programmes have their contents described by multiple means: textual subtitles, audiovisual files, and metadata such as genres. In order to represent these contents, we develop vectorial representations for their low-level multimodal features, group them with simple clustering techniques, and combine them using middle and late fusion. For textual features, we use LSI and Doc2Vec neural embeddings; for audio, MFCC's and Bags of Audio Words; for visual, SIFT, and Bags of Visual Words. We apply our model to a dataset of BBC TV programmes and use a standard recommender and pairwise similarity matrices of content vectors to estimate viewers' behaviours. The late fusion of genre, audio and video vectors with both of the textual embeddings significantly increase the precision and diversity of the results.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116342373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomasz Lyko, M. Broadbent, N. Race, M. Nilsson, Paul Farrow, S. Appleby
{"title":"Llama - Low Latency Adaptive Media Algorithm","authors":"Tomasz Lyko, M. Broadbent, N. Race, M. Nilsson, Paul Farrow, S. Appleby","doi":"10.1109/ISM.2020.00027","DOIUrl":"https://doi.org/10.1109/ISM.2020.00027","url":null,"abstract":"In the recent years, HTTP Adaptive Bit Rate (ABR) streaming including Dynamic Adaptive Streaming over HTTP (DASH) has become the most popular technology for video streaming over the Internet. The client device requests segments of content using HTTP, with an ABR algorithm selecting the quality at which to request each segment to trade-off video quality with the avoidance of stalling. This introduces high latency compared to traditional broadcast methods, mostly in the client buffer which needs to hold enough data to absorb any changes in network conditions. Clients employ an ABR algorithm which monitors network conditions and adjusts the quality at which segments are requested to maximise the user's Quality of Experience. The size of the client buffer depends on the ABR algorithm's capability to respond to changes in network conditions in a timely manner, hence, low latency live streaming requires an ABR algorithm that can perform well with a small client buffer. In this paper, we present Llama - a new ABR algorithm specifically designed to operate in such scenarios. Our new ABR algorithm employs the novel idea of using two independent throughput measurements made over different timescales. We have evaluated Llama by comparing it against four popular ABR algorithms in terms of multiple QoE metrics, across multiple client settings, and in various network scenarios based on CDN logs of a commercial live TV service. Llama outperforms other ABR algorithms, improving the P.1203 Mean Opinion Score (MOS) as well as reducing rebuffering by 33% when using DASH, and 68% with CMAF in the lowest latency scenario.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129608415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olav A. Norgård Rongved, S. Hicks, Vajira Lasantha Thambawita, H. Stensland, E. Zouganeli, Dag Johansen, M. Riegler, P. Halvorsen
{"title":"Real-Time Detection of Events in Soccer Videos using 3D Convolutional Neural Networks","authors":"Olav A. Norgård Rongved, S. Hicks, Vajira Lasantha Thambawita, H. Stensland, E. Zouganeli, Dag Johansen, M. Riegler, P. Halvorsen","doi":"10.1109/ISM.2020.00030","DOIUrl":"https://doi.org/10.1109/ISM.2020.00030","url":null,"abstract":"In this paper, we present an algorithm for automatically detecting events in soccer videos using 3D convolutional neural networks. The algorithm uses a sliding window approach to scan over a given video to detect events such as goals, yellow/red cards, and player substitutions. We test the method on three different datasets from SoccerNet, the Swedish Allsvenskan, and the Norwegian Eliteserien. Overall, the results show that we can detect events with high recall, low latency, and accurate time estimation. The trade-off is a slightly lower precision compared to the current state-of-the-art, which has higher latency and performs better when a less accurate time estimation can be accepted. In addition to the presented algorithm, we perform an extensive ablation study on how the different parts of the training pipeline affect the final results.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115026554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}