Proceedings of the 29th ACM International Conference on Multimedia最新文献

Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional Neural Network 基于关注卷积神经网络的多模板匹配预测融合编码

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475359

Qijun Wang, Guodong Zheng

{"title":"Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional Neural Network","authors":"Qijun Wang, Guodong Zheng","doi":"10.1145/3474085.3475359","DOIUrl":"https://doi.org/10.1145/3474085.3475359","url":null,"abstract":"In intra coding, template matching prediction is an effective method to reduce the non-local redundancy inside image content. However, the prediction indicated by the best template matching is not always the actually best prediction. To solve this problem, we propose a method, which merges multiple template matching predictions through a convolutional neural network with attention module. The convolutional neural network aims at exploring different combinations of the candidate template matching predictions, and the attention module focuses on determining the most significant prediction candidate. Besides, the spatial module in attention mechanism can be utilized to model the relationship between the original pixels in current block and the reconstructed pixels in adjacent regions (template). Compared to the directional intra prediction and traditional template matching prediction, our method can provide a unified framework to generate prediction with high accuracy. The experimental results show that, compared the averaging strategy, the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114986823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Information-Growth Attention Network for Image Super-Resolution 图像超分辨率信息增长关注网络

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475207

Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao

{"title":"Information-Growth Attention Network for Image Super-Resolution","authors":"Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao","doi":"10.1145/3474085.3475207","DOIUrl":"https://doi.org/10.1145/3474085.3475207","url":null,"abstract":"It is generally known that a high-resolution (HR) image contains more productive information compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies an information-growth process. Considering the property, we attempt to exploit the growing information via a particular attention mechanism. In this paper, we propose a concise but effective Information-Growth Attention Network (IGAN) that shows the incremental information is beneficial for SR. Specifically, a novel information-growth attention is proposed. It aims to pay attention to features involving large information-growth capacity by assimilating the difference from current features to the former features within a network. We also illustrate its effectiveness contrasted by widely-used self-attention using entropy and generalization analysis. Furthermore, existing channel-wise attention generation modules (CAGMs) have large informational attenuation due to directly calculating global mean for feature maps. Therefore, we present an innovative CAGM that progressively decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115148779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Heraclitus's Forest: An Interactive Artwork for Oral History 赫拉克利特的森林:口述历史的互动艺术作品

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478544

Lin Wang, Zhonghao Lin, Wei Cai

引用次数: 1

Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis 基于变压器的鲁棒多模态情感分析特征重构网络

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475585

Ziqi Yuan, Wei Li, Hua Xu, Wenmeng Yu

{"title":"Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis","authors":"Ziqi Yuan, Wei Li, Hua Xu, Wenmeng Yu","doi":"10.1145/3474085.3475585","DOIUrl":"https://doi.org/10.1145/3474085.3475585","url":null,"abstract":"Improving robustness against data missing has become one of the core challenges in Multimodal Sentiment Analysis (MSA), which aims to judge speaker sentiments from the language, visual, and acoustic signals. In the current research, translation-based methods and tensor regularization methods are proposed for MSA with incomplete modality features. However, both of them fail to cope with random modality feature missing in non-aligned sequences. In this paper, a transformer-based feature reconstruction network (TFR-Net) is proposed to improve the robustness of models for the random missing in non-aligned modality sequences. First, intra-modal and inter-modal attention-based extractors are adopted to learn robust representations for each element in modality sequences. Then, a reconstruction module is proposed to generate the missing modality features. With the supervision of SmoothL1Loss between generated and complete sequences, TFR-Net is expected to learn semantic-level features corresponding to missing features. Extensive experiments on two public benchmark datasets show that our model achieves good results against data missing across various missing modality combinations and various missing degrees.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115481106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion 基于多模态特征融合的视频关系与交互分析联合学习

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3479214

Beibei Zhang, Fan Yu, Yanxin Gao, Tongwei Ren, Gangshan Wu

引用次数: 10

ZoomSense: A Scalable Infrastructure for Augmenting Zoom ZoomSense:一个可扩展的基础设施，用于增加缩放

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478332

Tom Bartindale, Peter Chen, Harrison Marshall, Stanislav Pozdniakov, D. Richardson

引用次数: 4

SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition 基于顺序视图的分层注意网络三维形状识别

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475371

Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su

{"title":"SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition","authors":"Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su","doi":"10.1145/3474085.3475371","DOIUrl":"https://doi.org/10.1145/3474085.3475371","url":null,"abstract":"As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. A lot of deep learning models have been proposed for effective 3D shape representation. The view-based methods show the superiority due to the comprehensive exploration of the visual characteristics with the help of established 2D CNN architectures. Generally, the current approaches contain the following disadvantages: First, the most majority of methods lack the consideration for sequential information among the multiple views, which can provide descriptive characteristics for shape representation. Second, the incomprehensive exploration for the multi-view correlations directly affects the discrimination of shape descriptors. Finally, roughly aggregating multi-view features leads to the loss of descriptive information, which limits the shape representation effectiveness. To handle these issues, we propose a novel sequential view based hierarchical attention network (SVHAN) for 3D shape recognition. Specifically, we first divide the view sequence into several view blocks. Then, we introduce a novel hierarchical feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level, and shape-level features, the intra- and inter- view-block correlations are also captured to improve the discrimination of learned features. Subsequently, a novel selective fusion module (SFM) is designed for feature aggregation, considering the correlations between different levels and preserving effective information. Finally, discriminative and informative shape descriptors are generated for the recognition task. We validate the effectiveness of our proposed method on two public databases. The experimental results show the superiority of SVHAN against the current state-of-the-art approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127356571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

M3TR: Multi-modal Multi-label Recognition with Transformer 带变压器的多模态多标签识别

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475191

Jiawei Zhao, Yifan Zhao, Jia Li

{"title":"M3TR: Multi-modal Multi-label Recognition with Transformer","authors":"Jiawei Zhao, Yifan Zhao, Jia Li","doi":"10.1145/3474085.3475191","DOIUrl":"https://doi.org/10.1145/3474085.3475191","url":null,"abstract":"Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Multimodal Relation Extraction with Efficient Graph Alignment 基于高效图对齐的多模态关系提取

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3476968

Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang

{"title":"Multimodal Relation Extraction with Efficient Graph Alignment","authors":"Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang","doi":"10.1145/3474085.3476968","DOIUrl":"https://doi.org/10.1145/3474085.3476968","url":null,"abstract":"Relation extraction (RE) is a fundamental process in constructing knowledge graphs. However, previous methods on relation extraction suffer sharp performance decline in short and noisy social media texts due to a lack of contexts. Fortunately, the related visual contents (objects and their relations) in social media posts can supplement the missing semantics and help to extract relations precisely. We introduce the multimodal relation extraction (MRE), a task that identifies textual relations with visual clues. To tackle this problem, we present a large-scale dataset which contains 15000+ sentences with 23 pre-defined relation categories. Considering that the visual relations among objects are corresponding to textual relations, we develop a dual graph alignment method to capture this correlation for better performance. Experimental results demonstrate that visual contents help to identify relations more precisely against the text-only baselines. Besides, our alignment method can find the correlations between vision and language, resulting in better performance. Our dataset and code are available at https://github.com/thecharm/Mega.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123223809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Text2Video: Automatic Video Generation Based on Text Scripts Text2Video:基于文本脚本的自动视频生成

Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478548

Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun

引用次数: 2