ACM Transactions on Multimedia Computing Communications and Applications最新文献_第4页

Exploration of Speech and Music Information for Movie Genre Classification 探索用于电影类型分类的语音和音乐信息

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-07 DOI: 10.1145/3664197

Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha

{"title":"Exploration of Speech and Music Information for Movie Genre Classification","authors":"Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha","doi":"10.1145/3664197","DOIUrl":"https://doi.org/10.1145/3664197","url":null,"abstract":"Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial time steps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"28 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency EOGT：利用增强对象信息和全局时空依赖性进行视频异常检测

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-06 DOI: 10.1145/3662185

Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng

{"title":"EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency","authors":"Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng","doi":"10.1145/3662185","DOIUrl":"https://doi.org/10.1145/3662185","url":null,"abstract":"Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information, and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"242 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Price of Unlearning: Identifying Unlearning Risk in Edge Computing 不学习的代价：识别边缘计算中的未学习风险

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-06 DOI: 10.1145/3662184

Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou

{"title":"The Price of Unlearning: Identifying Unlearning Risk in Edge Computing","authors":"Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou","doi":"10.1145/3662184","DOIUrl":"https://doi.org/10.1145/3662184","url":null,"abstract":"Machine unlearning is an emerging paradigm that aims to make machine learning models “forget” what they have learned about particular data. It fulfills the requirements of privacy legislation (e.g., GDPR), which stipulates that individuals have the autonomy to determine the usage of their personal data. However, alongside all the achievements, there are still loopholes in machine unlearning that may cause significant losses for the system, especially in edge computing. Edge computing is a distributed computing paradigm with the purpose of migrating data processing tasks closer to terminal devices. While various machine unlearning approaches have been proposed to erase the influence of data sample(s), we claim that it might be dangerous to directly apply them in the realm of edge computing. A malicious edge node may broadcast (possibly fake) unlearning requests to a target data sample (s) and then analyze the behavior of edge devices to infer useful information. In this paper, we exploited the vulnerabilities of current machine unlearning strategies in edge computing and proposed a new inference attack to highlight the potential privacy risk. Furthermore, we developed a defense method against this particular type of attack and proposed the price of unlearning (PoU) as a means to evaluate the inefficiency it brings to an edge computing system. We provide theoretical analyses to show the upper bound of the PoU using tools borrowed from game theory. The experimental results on real-world datasets demonstrate that the proposed defense strategy is effective and capable of preventing an adversary from deducing useful information.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

InteractNet: Social Interaction Recognition for Semantic-rich Videos InteractNet：针对语义丰富的视频进行社交互动识别

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI: 10.1145/3663668

Yuanjie Lyu, Penggang Qin, Tong Xu, Chen Zhu, Enhong Chen

{"title":"InteractNet: Social Interaction Recognition for Semantic-rich Videos","authors":"Yuanjie Lyu, Penggang Qin, Tong Xu, Chen Zhu, Enhong Chen","doi":"10.1145/3663668","DOIUrl":"https://doi.org/10.1145/3663668","url":null,"abstract":"The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Retrieval-Augmented Architectures for Image Captioning 为图像标题设计检索增强架构

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI: 10.1145/3663667

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

{"title":"Towards Retrieval-Augmented Architectures for Image Captioning","authors":"Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara","doi":"10.1145/3663667","DOIUrl":"https://doi.org/10.1145/3663667","url":null,"abstract":"The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation 从视频激发的脑电信号中高效解码情感状态：实证研究

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI: 10.1145/3663669

Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva

{"title":"Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation","authors":"Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva","doi":"10.1145/3663669","DOIUrl":"https://doi.org/10.1145/3663669","url":null,"abstract":"Affect decoding through brain-computer interfacing (BCI) holds great potential to capture users’ feelings and emotional responses via non-invasive electroencephalogram (EEG) sensing. Yet, little research has been conducted to understand efficient decoding when users are exposed to dynamic audiovisual contents. In this regard, we study EEG-based affect decoding from videos in arousal and valence classification tasks, considering the impact of signal length, window size for feature extraction, and frequency bands. We train both classic Machine Learning models (SVMs and k-NNs) and modern Deep Learning models (FCNNs and GTNs). Our results show that: (1) affect can be effectively decoded using less than 1 minute of EEG signal; (2) temporal windows of 6 and 10 seconds provide the best classification performance for classic Machine Learning models but Deep Learning models benefit from much shorter windows of 2 seconds; and (3) any model trained on the Beta band alone achieves similar (sometimes better) performance than when trained on all frequency bands. Taken together, our results indicate that affect decoding can work in more realistic conditions than currently assumed, thus becoming a viable technology for creating better interfaces and user models.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gloss-driven Conditional Diffusion Models for Sign Language Production 用于手语生成的词汇驱动条件扩散模型

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI: 10.1145/3663572

Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong

{"title":"Gloss-driven Conditional Diffusion Models for Sign Language Production","authors":"Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong","doi":"10.1145/3663572","DOIUrl":"https://doi.org/10.1145/3663572","url":null,"abstract":"Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval 在弱监督文本到视频检索中利用实例级关系

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-03 DOI: 10.1145/3663571

Shukang Yin, Sirui Zhao, Hao Wang, Tong Xu, Enhong Chen

{"title":"Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval","authors":"Shukang Yin, Sirui Zhao, Hao Wang, Tong Xu, Enhong Chen","doi":"10.1145/3663571","DOIUrl":"https://doi.org/10.1145/3663571","url":null,"abstract":"Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: 1) a new matching scheme for pairing queries and their related moments in the video; 2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding 学习常识感知的时刻-文本对齐，实现快速视频时空定位

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-01 DOI: 10.1145/3663368

Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu

{"title":"Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding","authors":"Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu","doi":"10.1145/3663368","DOIUrl":"https://doi.org/10.1145/3663368","url":null,"abstract":"Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment network (C2AN), which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our C2AN method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"102 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects AGAR - 用于对可变形物体点云进行自适应运动预测的注意力图-RNN

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-01 DOI: 10.1145/3662183

Pedro de Medeiros Gomes, Silvia Rossi, Laura Toni

{"title":"AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects","authors":"Pedro de Medeiros Gomes, Silvia Rossi, Laura Toni","doi":"10.1145/3662183","DOIUrl":"https://doi.org/10.1145/3662183","url":null,"abstract":"This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"216 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0