Proceedings of the 29th ACM International Conference on Multimedia最新文献

筛选
英文 中文
Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional Neural Network 基于关注卷积神经网络的多模板匹配预测融合编码
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475359
Qijun Wang, Guodong Zheng
{"title":"Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional Neural Network","authors":"Qijun Wang, Guodong Zheng","doi":"10.1145/3474085.3475359","DOIUrl":"https://doi.org/10.1145/3474085.3475359","url":null,"abstract":"In intra coding, template matching prediction is an effective method to reduce the non-local redundancy inside image content. However, the prediction indicated by the best template matching is not always the actually best prediction. To solve this problem, we propose a method, which merges multiple template matching predictions through a convolutional neural network with attention module. The convolutional neural network aims at exploring different combinations of the candidate template matching predictions, and the attention module focuses on determining the most significant prediction candidate. Besides, the spatial module in attention mechanism can be utilized to model the relationship between the original pixels in current block and the reconstructed pixels in adjacent regions (template). Compared to the directional intra prediction and traditional template matching prediction, our method can provide a unified framework to generate prediction with high accuracy. The experimental results show that, compared the averaging strategy, the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114986823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Information-Growth Attention Network for Image Super-Resolution 图像超分辨率信息增长关注网络
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475207
Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao
{"title":"Information-Growth Attention Network for Image Super-Resolution","authors":"Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao","doi":"10.1145/3474085.3475207","DOIUrl":"https://doi.org/10.1145/3474085.3475207","url":null,"abstract":"It is generally known that a high-resolution (HR) image contains more productive information compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies an information-growth process. Considering the property, we attempt to exploit the growing information via a particular attention mechanism. In this paper, we propose a concise but effective Information-Growth Attention Network (IGAN) that shows the incremental information is beneficial for SR. Specifically, a novel information-growth attention is proposed. It aims to pay attention to features involving large information-growth capacity by assimilating the difference from current features to the former features within a network. We also illustrate its effectiveness contrasted by widely-used self-attention using entropy and generalization analysis. Furthermore, existing channel-wise attention generation modules (CAGMs) have large informational attenuation due to directly calculating global mean for feature maps. Therefore, we present an innovative CAGM that progressively decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115148779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Heraclitus's Forest: An Interactive Artwork for Oral History 赫拉克利特的森林:口述历史的互动艺术作品
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478544
Lin Wang, Zhonghao Lin, Wei Cai
{"title":"Heraclitus's Forest: An Interactive Artwork for Oral History","authors":"Lin Wang, Zhonghao Lin, Wei Cai","doi":"10.1145/3474085.3478544","DOIUrl":"https://doi.org/10.1145/3474085.3478544","url":null,"abstract":"Heraclitus's Forest is an interactive artwork that utilizes birch trees as a metaphor for the life stories recorded in an oral history database. We design a day/night cycle system to present the forest experience along the time elapse, multiple interaction modes to engage audiences' participation in history exploration, and evolving forest to arouse people's reflection on the feature of history, which is constantly being constructed but can never be returned to.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115231307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis 基于变压器的鲁棒多模态情感分析特征重构网络
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475585
Ziqi Yuan, Wei Li, Hua Xu, Wenmeng Yu
{"title":"Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis","authors":"Ziqi Yuan, Wei Li, Hua Xu, Wenmeng Yu","doi":"10.1145/3474085.3475585","DOIUrl":"https://doi.org/10.1145/3474085.3475585","url":null,"abstract":"Improving robustness against data missing has become one of the core challenges in Multimodal Sentiment Analysis (MSA), which aims to judge speaker sentiments from the language, visual, and acoustic signals. In the current research, translation-based methods and tensor regularization methods are proposed for MSA with incomplete modality features. However, both of them fail to cope with random modality feature missing in non-aligned sequences. In this paper, a transformer-based feature reconstruction network (TFR-Net) is proposed to improve the robustness of models for the random missing in non-aligned modality sequences. First, intra-modal and inter-modal attention-based extractors are adopted to learn robust representations for each element in modality sequences. Then, a reconstruction module is proposed to generate the missing modality features. With the supervision of SmoothL1Loss between generated and complete sequences, TFR-Net is expected to learn semantic-level features corresponding to missing features. Extensive experiments on two public benchmark datasets show that our model achieves good results against data missing across various missing modality combinations and various missing degrees.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115481106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion 基于多模态特征融合的视频关系与交互分析联合学习
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3479214
Beibei Zhang, Fan Yu, Yanxin Gao, Tongwei Ren, Gangshan Wu
{"title":"Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion","authors":"Beibei Zhang, Fan Yu, Yanxin Gao, Tongwei Ren, Gangshan Wu","doi":"10.1145/3474085.3479214","DOIUrl":"https://doi.org/10.1145/3474085.3479214","url":null,"abstract":"To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116912328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
ZoomSense: A Scalable Infrastructure for Augmenting Zoom ZoomSense:一个可扩展的基础设施,用于增加缩放
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478332
Tom Bartindale, Peter Chen, Harrison Marshall, Stanislav Pozdniakov, D. Richardson
{"title":"ZoomSense: A Scalable Infrastructure for Augmenting Zoom","authors":"Tom Bartindale, Peter Chen, Harrison Marshall, Stanislav Pozdniakov, D. Richardson","doi":"10.1145/3474085.3478332","DOIUrl":"https://doi.org/10.1145/3474085.3478332","url":null,"abstract":"We have seen a dramatic increase in the adoption of teleconferencing systems such as Zoom for remote teaching and working. Although designed primarily for traditional video conferencing scenarios, these platforms are actually being deployed in many diverse contexts. As such, Zoom offers little to aid hosts' understanding of attendee participation and often hinders participant agency. We introduce ZoomSense : an open-source, scalable infrastructure built upon 'virtual meeting participants', which exposes real-time meta-data, meeting content and host controls through an easy to use abstraction - so that developers can rapidly and sustainably augment Zoom.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"4 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120915028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition 基于顺序视图的分层注意网络三维形状识别
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475371
Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su
{"title":"SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition","authors":"Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su","doi":"10.1145/3474085.3475371","DOIUrl":"https://doi.org/10.1145/3474085.3475371","url":null,"abstract":"As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. A lot of deep learning models have been proposed for effective 3D shape representation. The view-based methods show the superiority due to the comprehensive exploration of the visual characteristics with the help of established 2D CNN architectures. Generally, the current approaches contain the following disadvantages: First, the most majority of methods lack the consideration for sequential information among the multiple views, which can provide descriptive characteristics for shape representation. Second, the incomprehensive exploration for the multi-view correlations directly affects the discrimination of shape descriptors. Finally, roughly aggregating multi-view features leads to the loss of descriptive information, which limits the shape representation effectiveness. To handle these issues, we propose a novel sequential view based hierarchical attention network (SVHAN) for 3D shape recognition. Specifically, we first divide the view sequence into several view blocks. Then, we introduce a novel hierarchical feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level, and shape-level features, the intra- and inter- view-block correlations are also captured to improve the discrimination of learned features. Subsequently, a novel selective fusion module (SFM) is designed for feature aggregation, considering the correlations between different levels and preserving effective information. Finally, discriminative and informative shape descriptors are generated for the recognition task. We validate the effectiveness of our proposed method on two public databases. The experimental results show the superiority of SVHAN against the current state-of-the-art approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127356571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
M3TR: Multi-modal Multi-label Recognition with Transformer 带变压器的多模态多标签识别
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3475191
Jiawei Zhao, Yifan Zhao, Jia Li
{"title":"M3TR: Multi-modal Multi-label Recognition with Transformer","authors":"Jiawei Zhao, Yifan Zhao, Jia Li","doi":"10.1145/3474085.3475191","DOIUrl":"https://doi.org/10.1145/3474085.3475191","url":null,"abstract":"Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Multimodal Relation Extraction with Efficient Graph Alignment 基于高效图对齐的多模态关系提取
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3476968
Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang
{"title":"Multimodal Relation Extraction with Efficient Graph Alignment","authors":"Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang","doi":"10.1145/3474085.3476968","DOIUrl":"https://doi.org/10.1145/3474085.3476968","url":null,"abstract":"Relation extraction (RE) is a fundamental process in constructing knowledge graphs. However, previous methods on relation extraction suffer sharp performance decline in short and noisy social media texts due to a lack of contexts. Fortunately, the related visual contents (objects and their relations) in social media posts can supplement the missing semantics and help to extract relations precisely. We introduce the multimodal relation extraction (MRE), a task that identifies textual relations with visual clues. To tackle this problem, we present a large-scale dataset which contains 15000+ sentences with 23 pre-defined relation categories. Considering that the visual relations among objects are corresponding to textual relations, we develop a dual graph alignment method to capture this correlation for better performance. Experimental results demonstrate that visual contents help to identify relations more precisely against the text-only baselines. Besides, our alignment method can find the correlations between vision and language, resulting in better performance. Our dataset and code are available at https://github.com/thecharm/Mega.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123223809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Text2Video: Automatic Video Generation Based on Text Scripts Text2Video:基于文本脚本的自动视频生成
Proceedings of the 29th ACM International Conference on Multimedia Pub Date : 2021-10-17 DOI: 10.1145/3474085.3478548
Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun
{"title":"Text2Video: Automatic Video Generation Based on Text Scripts","authors":"Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun","doi":"10.1145/3474085.3478548","DOIUrl":"https://doi.org/10.1145/3474085.3478548","url":null,"abstract":"To make video creation simpler, in this paper we present Text2Video, a novel system to automatically produce videos using only text-editing for novice users. Given an input text script, the director-like system can generate game-related engaging videos which illustrate the given narrative, provide diverse multi-modal content, and follow video editing guidelines. The system involves five modules: (1) A material manager extracts highlights from raw live game videos, and tags each video highlight, image and audio with labels. (2) A natural language processor extracts entities and semantics from the input text scripts. (3) A refined cross-modal retrieval searches for matching candidate shots from the material manager. (4) A text to speech speaker reads the processed text scripts with synthesized human voice. (5) The selected material shots and synthesized speech are assembled artistically through appropriate video editing techniques.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114935070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书