{"title":"Live 360° Video Streaming to Heterogeneous Clients in 5G Networks","authors":"Jacob Chakareski;Mahmudur Khan","doi":"10.1109/TMM.2024.3382910","DOIUrl":"10.1109/TMM.2024.3382910","url":null,"abstract":"We investigate rate-distortion-computing optimized live 360° video streaming to heterogeneous mobile VR clients in 5G networks. The client population comprises devices that feature single (LTE) or dual (LTE/NR) cellular connectivity. The content is compressed using scalable 360° tiling at the origin and sent towards the clients over a single backbone network link. A mobile edge server then adapts the incoming streaming data to the individual clients and their respective down-link transmission rates using formal rate-distortion-computing optimization. Single connectivity clients are served by the edge server a baseline representation/layer of the content adapted to their down-link transmission capacity and device computing capability. A dual connectivity client is served in parallel a baseline content layer on its LTE connectivity and a complementary viewport-specific enhancement layer on its NR connectivity, synergistically adapted to the respective down-links' transmission capacities and its computing capability. We formulate two optimization problems to conduct the operation of the edge server in each case, taking into account the key system components of the delivery process and induced end-to-end latency, aiming to maximize the immersion fidelity delivered to each client. We explore respective geometric programming optimization strategies that compute the optimal solutions at lower complexity. We rigorously analyze the computational complexity of the two optimization algorithms we formulate. In our evaluation, we demonstrate considerable performance gains over multiple assessment factors relative to two state-of-the-art techniques. We also examine the robustness of our approach to inaccurate user navigation prediction, transient NR link loss, dynamic LTE bandwidth variations, and diverse 360° video content. Finally, we contrast our results over five popular video quality metrics. The paper makes a community contribution by publicly sharing a dataset that captures the rate-quality trade-offs of the 360° video content used in our evaluation, for multiple contemporary quality metrics, to stimulate further studies and follow up work.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"8860-8873"},"PeriodicalIF":8.4,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei
{"title":"Multi-Level Pixel-Wise Correspondence Learning for 6DoF Face Pose Estimation","authors":"Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei","doi":"10.1109/TMM.2024.3391888","DOIUrl":"10.1109/TMM.2024.3391888","url":null,"abstract":"In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of \u0000<inline-formula><tex-math>$MAE_{r}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$ADD$</tex-math></inline-formula>\u0000 on ARKitFace and a 4.0%/0.7% improvement of \u0000<inline-formula><tex-math>$MAE_{t}$</tex-math></inline-formula>\u0000 on ARKitFace/BIWI.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9423-9435"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Downstream-Pretext Domain Knowledge Traceback for Active Learning","authors":"Beichen Zhang;Liang Li;Zheng-Jun Zha;Jiebo Luo;Qingming Huang","doi":"10.1109/TMM.2024.3391897","DOIUrl":"10.1109/TMM.2024.3391897","url":null,"abstract":"Active learning (AL) is designed to construct a high-quality labeled dataset by iteratively selecting the most informative samples. Such sampling heavily relies on data representation, while recently pre-training is popular for robust feature learning. However, as pre-training utilizes low-level pretext tasks that lack annotation, directly using pre-trained representation in AL is inadequate for determining the sampling score. To address this problem, we propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance for selecting diverse and instructive samples near the decision boundary. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. The diversity indicator constructs two feature spaces based on the pre-training pretext model and the downstream knowledge from annotation, by which it locates the neighbors of unlabeled data from the downstream space in the pretext space to explore the interaction of samples. With this mechanism, DOKT unifies the data relations of low-level and high-level representations to estimate traceback diversity. Next, in the uncertainty estimator, domain mixing is designed to enforce perceptual perturbing to unlabeled samples with similar visual patches in the pretext space. Then the divergence of perturbed samples is measured to estimate the domain uncertainty. As a result, DOKT selects the most diverse and important samples based on these two modules. The experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods and generalizes well to various application scenarios such as semantic segmentation and image captioning.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10585-10596"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement","authors":"Xuejin Wang;Leilei Huang;Hangwei Chen;Qiuping Jiang;Shaowei Weng;Feng Shao","doi":"10.1109/TMM.2024.3391907","DOIUrl":"10.1109/TMM.2024.3391907","url":null,"abstract":"Night-time image enhancement (NIE) aims at boosting the intensity of low-light regions while suppressing noises or light effects in night-time images, and numerous efforts have been made for this task. However, few explorations focus on the quality evaluation issue of enhanced night-time images (ENTIs), and how to fairly compare the performance of different NIE algorithms remains a challenging problem. In this paper, we firstly construct a new Real-world Night-Time Image Enhancement Quality Assessment (i.e., RNTIEQA) dataset that includes two typical types of night-time scenes (i.e., extremely low light and uneven light scenes), and carry out human subjective studies to compare the quality of ENTIs obtained by a set of representative NIE algorithms. Afterwards, a new objective ranking method that comprehensively considering image intrinsic and impairment attributes is proposed for automatically predicting the quality of ENTIs. Experimental results on our RNTIEQA dataset demonstrate that the proposed method outperforms the off-the-shelf competitors. Our dataset and code will be released at \u0000<uri>https://github.com/Leilei-Huang-work/RNTIEQA-dataset</uri>\u0000.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9436-9449"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Deformable RoI Pooling and Semi-Decoupled Head for Object Detection","authors":"Bo Han;Lihuo He;Ying Yu;Wen Lu;Xinbo Gao","doi":"10.1109/TMM.2024.3391899","DOIUrl":"10.1109/TMM.2024.3391899","url":null,"abstract":"Object detection aims to classify interest objects within an image and pinpoint their positions using predicted rectangular bounding boxes. However, classification and localization tasks are heterogeneous, not only spatially misaligned but also differing in properties and feature requirements. Modern detectors commonly share the spatial region and detection head for both tasks, making them challenging to achieve optimal performance altogether, resulting in inconsistent accuracy. Specifically, the predicted bounding box may have higher classification confidence but lower localization quality, or vice versa. To tackle this issue, the spatial decoupling mechanism via general deformable RoI pooling is first proposed. This mechanism separately pursues the favorable regions for classification and localization, and subsequently extracts the corresponding features. Then, the semi-decoupled head is designed. Compared to the decoupled head that utilizes independent classification and localization networks, potentially leading to excessive decoupling and compromised detection performance, the semi-decoupled head enables the networks to mutually enhance each other while concentrating on their respective tasks. In addition, the semi-decoupled head also introduces a redundancy suppression module to filter out redundant task-irrelevant information of features extracted by separate networks and reinforce task-related information. By combining the spatial decoupling mechanism with the semi-decoupled head, the proposed detector achieves an impressive 43.7 AP in Faster R-CNN framework with ResNet-101 as backbone network. Without bells and whistles, extensive experimental results on the popular MS COCO dataset demonstrate that the proposed detector suppresses the baseline by a significant margin and outperforms some state-of-the-art detectors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9410-9422"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingyu Gao;Xi Wang;Zhenyu Chen;Wei Zhou;Steven C. H. Hoi
{"title":"Knowledge Enhanced Vision and Language Model for Multi-Modal Fake News Detection","authors":"Xingyu Gao;Xi Wang;Zhenyu Chen;Wei Zhou;Steven C. H. Hoi","doi":"10.1109/TMM.2023.3330296","DOIUrl":"10.1109/TMM.2023.3330296","url":null,"abstract":"The rapid dissemination of fake news and rumors through the Internet and social media platforms poses significant challenges and raises concerns in the public sphere. Automatic detection of fake news plays a crucial role in mitigating the spread of misinformation. While recent approaches have focused on leveraging neural networks to improve textual and visual representations in multi-modal fake news analysis, they often overlook the potential of incorporating knowledge information to verify facts within news articles. In this paper, we present a vision and language model that incorporates knowledge to enhance multi-modal fake news detection. Our proposed model integrates information from large scale open knowledge graphs to augment its ability to discern the veracity of news content. Unlike previous methods that utilize separate models to extract textual and visual features, we synthesize a unified model capable of extracting both types of features simultaneously. To represent news articles, we introduce a graph structure where nodes encompass entities, relationships extracted from the textual content, and objects depicted in associated images. By utilizing the knowledge graph, we establish meaningful relationships between nodes within the news articles. Experimental evaluations on a real-world multi-modal dataset from Twitter demonstrate significant performance improvement by incorporating knowledge information.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"8312-8322"},"PeriodicalIF":8.4,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Feature Semantic Matching for Spatio-Temporal Video Grounding","authors":"Tong Zhang;Hao Fang;Hao Zhang;Jialin Gao;Xiankai Lu;Xiushan Nie;Yilong Yin","doi":"10.1109/TMM.2024.3387696","DOIUrl":"10.1109/TMM.2024.3387696","url":null,"abstract":"Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube, including temporal boundaries and object bounding boxes, that semantically corresponds to a given language description in an untrimmed video. The existing one-stage solutions in this task face two significant challenges, namely, vision-text semantic misalignment and spatial mislocalization, which limit their performance in grounding. These two limitations are mainly caused by neglect of fine-grained alignment in cross-modality fusion and the reliance on a text-agnostic query in sequentially spatial localization. To address these issues, we propose an effective model with a newly designed Feature Semantic Matching (FSM) module based on a Transformer architecture to address the above issues. Our method introduces a cross-modal feature matching module to achieve multi-granularity alignment between video and text while preventing the weakening of important features during the feature fusion stage. Additionally, we design a query-modulated matching module to facilitate text-relevant tube construction by multiple query generation and tubulet sequence matching. To ensure the quality of tube construction, we employ a novel mismatching rectify contrastive loss to rectify the mismatching between the learnable query and the objects corresponding to the text descriptions by restricting the generated spatial query. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on two challenging STVG benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9268-9279"},"PeriodicalIF":8.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang Chen;Qing Wang;Jun Du;Gen-Shun Wan;Shi-Fu Xiong;Bao-Ci Yin;Jia Pan;Chin-Hui Lee
{"title":"Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading","authors":"Hang Chen;Qing Wang;Jun Du;Gen-Shun Wan;Shi-Fu Xiong;Bao-Ci Yin;Jia Pan;Chin-Hui Lee","doi":"10.1109/TMM.2024.3390148","DOIUrl":"10.1109/TMM.2024.3390148","url":null,"abstract":"We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehensive analysis of preliminary experimental results reveals the complementary nature of the conventional end-to-end (E2E) and proposed VSM frameworks, especially concerning speaker head movements. To increase lip reading accuracy, we propose hybrid viseme subwords and end-to-end modeling (HVSEM), which exploits the strengths of both approaches through multitask learning. As an extension to HVSEM, we also propose collaborative viseme subword and end-to-end modeling (CVSEM), which further explores the synergy between the VSM and E2E frameworks by integrating a state-mapped temporal mask (SMTM) into joint modeling. Experimental evaluations using different model backbones on both the LRW and LRW-1000 datasets confirm the superior performance and generalizability of the proposed frameworks. Specifically, VSM outperforms the baseline E2E framework, while HVSEM outperforms VSM in a hybrid combination of VSM and E2E modeling. Building on HVSEM, CVSEM further achieves impressive accuracies on 90.75% and 58.89%, setting new benchmarks for both datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9358-9371"},"PeriodicalIF":8.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subjective Media Quality Recovery From Noisy Raw Opinion Scores: A Non-Parametric Perspective","authors":"Andrés Altieri;Lohic Fotio Tiotsop;Giuseppe Valenzise","doi":"10.1109/TMM.2024.3390113","DOIUrl":"10.1109/TMM.2024.3390113","url":null,"abstract":"This paper focuses on the challenge of accurately estimating the subjective quality of multimedia content from noisy opinion scores gathered from end-users. State-of-the-art methods rely on parametric statistical models to capture the subject's scoring behavior and recover quality estimates. However, these approaches have limitations, as they often require restrictive assumptions to achieve numerical stability during parameter estimation, leading to a lack of robustness when the modeling hypotheses do not fit the data. To overcome these limitations, we propose a paradigm shift towards non-parametric statistical methods. Specifically, we introduce a threefold contribution: i) in contrast to the prevailing approach in subjective quality recovery assuming a parametric score distribution, we propose a non parametric approach that guarantees greater accuracy by measuring reliability per subject and per stimulus, overcoming the limits of existing approaches that measure only per subject reliability; ii) we propose ESQR, a non-parametric algorithm for subjective quality recovery, demonstrating experimentally that it has higher robustness to noise compared to numerous state-of-the-art algorithms, thanks to the weaker assumptions made on data compared to parametric approaches; iii) the proposed approach is theoretically grounded, i.e., we define a non-parametric statistic and prove mathematically that it provides a measure of score reliability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9342-9357"},"PeriodicalIF":8.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10504622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Music-Driven Choreography Based on Music Feature Clusters and Dynamic Programming","authors":"Shuhong Lin;Moshe Zukerman;Hong Yan","doi":"10.1109/TMM.2024.3390232","DOIUrl":"10.1109/TMM.2024.3390232","url":null,"abstract":"Generating choreography from music poses a significant challenge. Conventional dance generation methods are limited by only being able to match specific dance movements to music with corresponding rhythms, restricting the utilization of existing dance sequences. To address this limitation, we propose a method that generates a label, based on a probability distribution function derived from music features, that can be applied to music segments of varying lengths. By using the Kullback-Leibler divergence, we assess the similarity between music segments based on these labels. To ensure adaptability to different musical rhythms, we employ a cubic spline method to represent dance movements. This approach allows us to control the speed of a dance sequence by resampling it, enabling adaptation to varying rhythms based on the tempo of newly input music. To evaluate the effectiveness of our method, we compared the dances generated by our approach with those generated by other neural network-based and conventional methods. Quantitative evaluations demonstrated that our method outperforms these alternatives in terms of dance quality and fidelity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9330-9341"},"PeriodicalIF":8.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}