Wei Gao;Jintian Feng;Mengqi Wei;Rui Zou;Jianwen Sun
{"title":"Towards a Multi-Granulated Statistical Framework for Human–Machine Collaboration in Image Classification","authors":"Wei Gao;Jintian Feng;Mengqi Wei;Rui Zou;Jianwen Sun","doi":"10.1109/TMM.2024.3521811","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521811","url":null,"abstract":"In the past decade, despite significant advancements in Artificial Intelligence (AI) and deep learning technologies, they still fall short of fully replicating the complex functions of the human brain. This highlights the importance of researching human-machine collaborative systems. This study introduces a statistical framework capable of finely modeling integrated performance, breaking it down into the individual performance term and the diversity term, thereby enhancing interpretability and estimation accuracy. Extensive multi-granularity experiments were conducted using this framework on various image classification datasets, revealing the differences between humans and machines in classification tasks from macro to micro levels. This difference is key to improving human-machine collaborative performance, as it allows for complementary strengths. The study found that Human-Machine collaboration (HM) often outperforms individual human (H) or machine (M) performances, but not always. The superiority of performance depends on the interplay between the individual performance term and the diversity term. To further enhance the performance of human-machine collaboration, a novel Human-Adapter-Machine (HAM) model is introduced. Specifically, HAM can adaptively adjust decision weights to enhance the complementarity among individuals. Theoretical analysis and experimental results both demonstrate that HAM outperforms the traditional HM strategy and the individual agent (H or M).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1625-1636"},"PeriodicalIF":8.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143783200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph Proxy Fusion: Consensus Graph Intermediated Multi-View Local Information Fusion Clustering","authors":"Haoran Li;Yulan Guo;Jiali You;Xiaojian You;Zhenwen Ren","doi":"10.1109/TMM.2024.3521803","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521803","url":null,"abstract":"Multi-view clustering (MVC) can fuse the information of multiple views for robust clustering result, among it two fusion strategies, <italic>early-fusion</i> and <italic>late-fusion</i> are widely adopted. Although they have derived many MVC methods, there are still two crucial questions: (1) <italic>early-fusion</i> forces multiple views to share a consensus latent representation, which compounds the challenge of excavating view-specific diverse local information and (2) <italic>late-fusion</i> generates view-partitions independently and then integrates them in the following clustering procedure, where the two procedures cannot guide each other and lack necessary negotiation. In view of this, we propose a novel Graph Proxy Fusion (GPF) method to preserve and fuse view-specific local information concertedly in one unified framework. Specifically, we first propose anchor-based local information learning to capture view-specific local structural information in bipartite graphs; meanwhile, a view-consensus graph learned through self-expressiveness-based proxy graph learning module is deemed as a higher-order proxy; following, the novel graph proxy fusion module integrally embeds all lower-order bipartite graphs in the higher-order proxy via higher-order correlation theory. As a novel fusion strategy, the proposed GPF efficiently investigates the valuable consensus and diverse information of multiple views. Experiments on various multi-view datasets demonstrate the superiority of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1736-1747"},"PeriodicalIF":8.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen
{"title":"Point Patches Contrastive Learning for Enhanced Point Cloud Completion","authors":"Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen","doi":"10.1109/TMM.2024.3521854","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521854","url":null,"abstract":"In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"581-596"},"PeriodicalIF":8.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization","authors":"Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng","doi":"10.1109/TMM.2024.3521671","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521671","url":null,"abstract":"Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"120-132"},"PeriodicalIF":8.4,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"List of Reviewers","authors":"","doi":"10.1109/TMM.2024.3501532","DOIUrl":"https://doi.org/10.1109/TMM.2024.3501532","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11428-11439"},"PeriodicalIF":8.4,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10823085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Luo;Baoliang Chen;Lingyu Zhu;Peilin Chen;Shiqi Wang
{"title":"RCNet: Deep Recurrent Collaborative Network for Multi-View Low-Light Image Enhancement","authors":"Hao Luo;Baoliang Chen;Lingyu Zhu;Peilin Chen;Shiqi Wang","doi":"10.1109/TMM.2024.3521760","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521760","url":null,"abstract":"Scene observation from multiple perspectives brings a more comprehensive visual experience. However, acquiring multiple views in the dark causes highly correlated views alienated, making it challenging to improve scene understanding with auxiliary views. Recent single image-based enhancement methods may not provide consistently desirable restoration performance for all views due to ignoring potential feature correspondence among views. To alleviate this issue, we make the first attempt to investigate multi-view low-light image enhancement. First, we construct a new dataset called Multi-View Low-light Triplets (MVLT), including 1,860 pairs of triple images with large illumination ranges and wide noise distribution. Each triplet is equipped with three viewpoints towards the same scene. Second, we propose a multi-view enhancement framework based on the Recurrent Collaborative Network (RCNet). To benefit from similar texture correspondence across views, we design the recurrent feature enhancement, alignment, and fusion (ReEAF) module, where intra-view feature enhancement (Intra-view EN) followed by inter-view feature alignment and fusion (Inter-view AF) is performed to model intra-view and inter-view feature propagation via multi-view collaboration. Additionally, two modules from enhancement to alignment (E2A) and alignment to enhancement (A2E) are developed to enable interactions between Intra-view EN and Inter-view AF, utilizing attentive feature weighting and sampling for enhancement and alignment. Experimental results demonstrate our RCNet significantly outperforms other state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2001-2014"},"PeriodicalIF":8.4,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding","authors":"Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521676","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521676","url":null,"abstract":"Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"95-107"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar
{"title":"SOFW: A Synergistic Optimization Framework for Indoor 3D Object Detection","authors":"Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar","doi":"10.1109/TMM.2024.3521782","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521782","url":null,"abstract":"In this work, we observe that indoor 3D object detection across varied scene domains encompasses both universal attributes and specific features. Based on this insight, we propose SOFW, a synergistic optimization framework that investigates the feasibility of optimizing 3D object detection tasks concurrently spanning several dataset domains. The core of SOFW is identifying domain-shared parameters to encode universal scene attributes, while employing domain-specific parameters to delve into the particularities of each scene domain. Technically, we introduce a set abstraction alteration strategy (SAAS) that embeds learnable domain-specific features into set abstraction layers, thus empowering the network with a refined comprehension for each scene domain. Besides, we develop an element-wise sharing strategy (ESS) to facilitate fine-grained adaptive discernment between domain-shared and domain-specific parameters for network layers. Benefited from the proposed techniques, SOFW crafts feature representations for each scene domain by learning domain-specific parameters, whilst encoding generic attributes and contextual interdependencies via domain-shared parameters. Built upon the classical detection framework VoteNet without any complicated modules, SOFW delivers impressive performances under multiple benchmarks with much fewer total storage footprint. Additionally, we demonstrate that the proposed ESS is a universal strategy and applying it to a voxels-based approach TR3D can realize cutting-edge detection accuracy on all S3DIS, ScanNet, and SUN RGB-D datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"637-651"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Neural Codec-Empowered 360$^circ$ Video Streaming: A Saliency-Aided Synergistic Approach","authors":"Jianxin Shi;Miao Zhang;Linfeng Shen;Jiangchuan Liu;Lingjun Pu;Jingdong Xu","doi":"10.1109/TMM.2024.3521770","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521770","url":null,"abstract":"Networked 360<inline-formula><tex-math>$^circ$</tex-math></inline-formula> video has become increasingly popular. Despite the immersive experience for users, its sheer data volume, even with the latest H.266 coding and viewport adaptation, remains a significant challenge to today's networks. Recent studies have shown that integrating deep learning into video coding can significantly enhance compression efficiency, providing new opportunities for high-quality video streaming. In this work, we conduct a comprehensive analysis of the potential and issues in applying neural codecs to 360<inline-formula><tex-math>$^circ$</tex-math></inline-formula> video streaming. We accordingly present <inline-formula><tex-math>$mathsf {NETA}$</tex-math></inline-formula>, a synergistic streaming scheme that merges neural compression with traditional coding techniques, seamlessly implemented within an edge intelligence framework. To address the non-trivial challenges in the short viewport prediction window and time-varying viewing directions, we propose implicit-explicit buffer-based prefetching grounded in content visual saliency and bitrate adaptation with smart model switching around viewports. A novel Lyapunov-guided deep reinforcement learning algorithm is developed to maximize user experience and ensure long-term system stability. We further discuss the concerns towards practical development and deployment and have built a working prototype that verifies <inline-formula><tex-math>$mathsf {NETA}$</tex-math></inline-formula>’s excellent performance. For instance, it achieves a 27% increment in viewing quality, a 90% reduction in rebuffering time, and a 64% decrease in quality variation on average, compared to state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1588-1600"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Wang;Yunzhou Zhang;Fawei Ge;Wenjing Bai;Yifan Wang
{"title":"Learning Local Features by Reinforcing Spatial Structure Information","authors":"Li Wang;Yunzhou Zhang;Fawei Ge;Wenjing Bai;Yifan Wang","doi":"10.1109/TMM.2024.3521777","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521777","url":null,"abstract":"Learning-based local feature extraction algorithms have advanced considerably in terms of robustness. While excelling at enhancing feature robustness, some outstanding algorithms tend to neglect discriminability—a crucial aspect in vision tasks. With the increase of deep learning convolutional layers, we observe an amplification of semantic information within images, accompanied by a diminishing presence of spatial structural information. This imbalance primarily contributes to the subpar feature discriminability. Therefore, this paper introduces a novel network framework aimed at imbuing feature descriptors with robustness and discriminative power by reinforcing spatial structural information. Our approach incorporates a spatial structure enhancement module into the network architecture, spanning from shallow to deep layers, ensuring the retention of rich structural information in deeper layers, thereby enhancing discriminability. Finally, we evaluate our method, demonstrating superior performance in visual localization and feature-matching tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1420-1431"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}