ACM Transactions on Multimedia Computing Communications and Applications最新文献_第10页

Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations 适合卡通插图且风格一致的多纹理推荐

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-12 DOI: 10.1145/3652518

Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee

{"title":"Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations","authors":"Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee","doi":"10.1145/3652518","DOIUrl":"https://doi.org/10.1145/3652518","url":null,"abstract":"Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines. Though several texture pickers have been proposed, they still require the users to browse the textures by themselves, which is still labor-intensive and time-consuming. In this paper, an automatic texture recommendation system is proposed for recommending multiple textures to replace a set of user-specified regions in a cartoon illustration with visually pleasant look. Two measurements, the suitability measurement and the style-consistency measurement, are proposed to make sure that the recommended textures are suitable for cartoon illustration and at the same time mutually consistent in style. The suitability is measured based on the synthesizability, cartoonity, and region fitness of textures. The style-consistency is predicted using a learning-based solution since it is subjective to judge whether two textures are consistent in style. An optimization problem is formulated and solved via the genetic algorithm. Our method is validated on various cartoon illustrations, and convincing results are obtained.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"37 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140106551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing 从多模态歌唱中自动转录歌词和自动转录音乐

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-12 DOI: 10.1145/3651310

Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang

{"title":"Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing","authors":"Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang","doi":"10.1145/3651310","DOIUrl":"https://doi.org/10.1145/3651310","url":null,"abstract":"Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"45 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images 掌握深度伪造检测：区分 GAN 和扩散模型图像的尖端方法

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-09 DOI: 10.1145/3652027

Luca Guarnera, Oliver Giudice, Sebastiano Battiato

{"title":"Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images","authors":"Luca Guarnera, Oliver Giudice, Sebastiano Battiato","doi":"10.1145/3652027","DOIUrl":"https://doi.org/10.1145/3652027","url":null,"abstract":"Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different deepfake detection and recognition tasks, we proposed a hierarchical multi-level approach. At the first level, we classified real images from AI-generated ones. At the second level, we distinguished between images generated by GANs and DMs. At the third level (composed of two additional sub-levels), we recognized the specific GAN and DM architectures used to generate the synthetic data. Experimental results demonstrated that our approach achieved more than 97% classification accuracy, outperforming existing state-of-the-art methods. The models obtained in the different levels turn out to be robust to various attacks such as JPEG compression (with different quality factor values) and resize (and others), demonstrating that the framework can be used and applied in real-world contexts (such as the analysis of multimedia data shared in the various social platforms) for support even in forensic investigations in order to counter the illicit use of these powerful and modern generative models. We are able to identify the specific GAN and DM architecture used to generate the image, which is critical in tracking down the source of the deepfake. Our hierarchical multi-level approach to deepfake detection and recognition shows promising results in identifying deepfakes allowing focus on underlying task by improving (about (2% ) on the average) standard multiclass flat detection systems. The proposed method has the potential to enhance the performance of deepfake detection systems, aid in the fight against the spread of fake images, and safeguard the authenticity of digital media.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"6 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal 利用一致性约束和手语人移除改进连续手语识别

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-08 DOI: 10.1145/3640815

Ronglai Zuo, Brian Mak

{"title":"Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal","authors":"Ronglai Zuo, Brian Mak","doi":"10.1145/3640815","DOIUrl":"https://doi.org/10.1145/3640815","url":null,"abstract":"Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"53 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective RGBT 单目标跟踪方法回顾与分析：融合视角

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-07 DOI: 10.1145/3651308

ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo

{"title":"Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective","authors":"ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo","doi":"10.1145/3651308","DOIUrl":"https://doi.org/10.1145/3651308","url":null,"abstract":"Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have explored the potential of combining the visible and infrared modalities to improve tracking performance. By leveraging the complementary strengths of visible and infrared data, RGB-infrared fusion tracking has emerged as a promising approach to address these limitations and improve tracking accuracy in challenging scenarios. In this paper, we present a review on RGB-infrared fusion tracking. Specifically, we categorize existing RGBT tracking methods into four categories based on their underlying architectures, feature representations, and fusion strategies, namely feature decoupling based method, feature selecting based method, collaborative graph tracking method, and traditional fusion method. Furthermore, we provide a critical analysis of their strengths, limitations, representative methods, and future research directions. To further demonstrate the advantages and disadvantages of these methods, we present a review of publicly available RGBT tracking datasets and analyze the main results on public datasets. Moreover,we discuss some limitations in RGBT tracking at present and provide some opportunities and future directions for RGBT visual tracking, such as dataset diversity, unsupervised and weakly supervised applications. In conclusion, our survey aims to serve as a useful resource for researchers and practitioners interested in the emerging field of RGBT tracking, and to promote further progress and innovation in this area.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Backdoor Two-Stream Video Models on Federated Learning 基于联合学习的后门双流视频模型

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-07 DOI: 10.1145/3651307

Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione

{"title":"Backdoor Two-Stream Video Models on Federated Learning","authors":"Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione","doi":"10.1145/3651307","DOIUrl":"https://doi.org/10.1145/3651307","url":null,"abstract":"Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks, in which an attacker, as one of the training participants, submits poisoned parameters and thus injects the backdoor into the global model after aggregation. Existing backdoor attacks against videos based on FL only poison RGB frames, which makes that the attack could be easily mitigated by two-stream model neutralization. Therefore, it is a big challenge to manipulate the most advanced two-stream video model with a high success rate by poisoning only a small proportion of training data in the framework of FL. In this paper, a new backdoor attack scheme incorporating the rich spatial and temporal structures of video data is proposed, which injects the backdoor triggers into both the optical flow and RGB frames of video data through multiple rounds of model aggregations. In addition, the adversarial attack is utilized on the RGB frames to further boost the robustness of the attacks. Extensive experiments on real-world datasets verify that our methods outperform the state-of-the-art backdoor attacks and show better performance in terms of stealthiness and persistence.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"122 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Delay threshold for social interaction in volumetric eXtended Reality communication 体积扩展现实通信中社交互动的延迟阈值

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-06 DOI: 10.1145/3651164

Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César

{"title":"Delay threshold for social interaction in volumetric eXtended Reality communication","authors":"Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César","doi":"10.1145/3651164","DOIUrl":"https://doi.org/10.1145/3651164","url":null,"abstract":"Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analysing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article, additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"32 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Video Super-Resolution Network Towards Compressed Data 面向压缩数据的增强型视频超分辨率网络

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-06 DOI: 10.1145/3651309

Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao

{"title":"Enhanced Video Super-Resolution Network Towards Compressed Data","authors":"Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao","doi":"10.1145/3651309","DOIUrl":"https://doi.org/10.1145/3651309","url":null,"abstract":"Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose an enhanced VSR model towards compressed videos, termed as ECVSR, to simultaneously achieve compression artifacts reduction and SR reconstruction end-to-end. ECVSR contains a motion-excited temporal adaption network (METAN) and a multi-frame SR network (SRNet). The METAN takes decoded LR video frames as input and models inter-frame correlations via bidirectional deformable alignment and motion-excited temporal adaption, where temporal differences are calculated as motion prior to excite the motion-sensitive regions of temporal features. In SRNet, cascaded recurrent multi-scale blocks (RMSB) are employed to learn deep spatio-temporal representations from adapted multi-frame features. Then, we build a reconstruction module for spatio-temporal information integration and HR frame reconstruction, which is followed by a detail refinement module for texture and visual quality enhancement. Extensive experimental results on compressed videos demonstrate the superiority of our method for compressed VSR. Code will be available at https://github.com/lifengcs/ECVSR.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"75 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio 对带有非空间和空间音频的 360° 视频进行体验质量和视觉注意力评估

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-06 DOI: 10.1145/3650208

Amit Hirway, Yuansong Qiao, Niall Murray

{"title":"A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio","authors":"Amit Hirway, Yuansong Qiao, Niall Murray","doi":"10.1145/3650208","DOIUrl":"https://doi.org/10.1145/3650208","url":null,"abstract":"This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations. The configurations evaluated were no sound; non-spatial (stereo) audio; and two spatial sound conditions (first and third-order ambisonics). The videos covered various categories and presented both indoor and outdoor scenarios. The subjective responses were analyzed using an ANOVA (Analysis of Variance) to assess mean differences between sound conditions. Data visualization was also employed to enhance the interpretability of the results. The findings reveal diverse viewing patterns, physiological responses, and subjective experiences among users watching 360° videos with different sound conditions. Spatial audio, in particular third-order ambisonics, garnered heightened attention. This is evident in increased pupil dilation and heart rate. Furthermore, the presence of spatial audio led to more diverse head poses when sound sources were distributed across the scene. These findings have important implications for the development of effective techniques for optimizing processing, encoding, distributing, and rendering content in VR and 360° videos with spatialized audio. These insights are also relevant in the creative realms of content design and enhancement. They provide valuable guidance on how spatial audio influences user attention, physiological responses, and overall subjective experiences. Understanding these dynamics can assist content creators and designers in crafting immersive experiences that leverage spatialized audio to captivate users, enhance engagement, and optimize the overall quality of virtual reality and 360° video content. The dataset, scripts used for data collection, ffmpeg commands used for processing the videos and the subjective questionnaire and its statistical analysis are publicly available.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"43 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming GreenABR+：广义能量感知自适应比特率流媒体

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-05 DOI: 10.1145/3649898

Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow

{"title":"GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming","authors":"Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow","doi":"10.1145/3649898","DOIUrl":"https://doi.org/10.1145/3649898","url":null,"abstract":"Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding energy footprint, resulting in increased energy consumption, especially for mobile device users. Additionally, most ABR algorithms do not consider perceived quality, leading to suboptimal user experience. This paper proposes a novel ABR scheme called GreenABR+, which utilizes deep reinforcement learning to optimize energy consumption during video streaming while maintaining high user QoE. Unlike existing rule-based ABR algorithms, GreenABR+ makes no assumptions about video settings or the streaming environment. GreenABR+ model works on different video representation sets and can adapt to dynamically changing conditions in a wide range of network scenarios. Our experiments demonstrate that GreenABR+ outperforms state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 57% in data consumption while providing up to 25% more perceptual QoE due to up to 87% less rebuffering time and near-zero capacity violations. The generalization and dynamic adaptability make GreenABR+ a flexible solution for energy-efficient ABR optimization.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"237 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0