{"title":"Merging Multiple Template Matching Predictions in Intra Coding with Attentive Convolutional Neural Network","authors":"Qijun Wang, Guodong Zheng","doi":"10.1145/3474085.3475359","DOIUrl":"https://doi.org/10.1145/3474085.3475359","url":null,"abstract":"In intra coding, template matching prediction is an effective method to reduce the non-local redundancy inside image content. However, the prediction indicated by the best template matching is not always the actually best prediction. To solve this problem, we propose a method, which merges multiple template matching predictions through a convolutional neural network with attention module. The convolutional neural network aims at exploring different combinations of the candidate template matching predictions, and the attention module focuses on determining the most significant prediction candidate. Besides, the spatial module in attention mechanism can be utilized to model the relationship between the original pixels in current block and the reconstructed pixels in adjacent regions (template). Compared to the directional intra prediction and traditional template matching prediction, our method can provide a unified framework to generate prediction with high accuracy. The experimental results show that, compared the averaging strategy, the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114986823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao
{"title":"Information-Growth Attention Network for Image Super-Resolution","authors":"Zhuangzi Li, Ge Li, Thomas H. Li, Shan Liu, Wei Gao","doi":"10.1145/3474085.3475207","DOIUrl":"https://doi.org/10.1145/3474085.3475207","url":null,"abstract":"It is generally known that a high-resolution (HR) image contains more productive information compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies an information-growth process. Considering the property, we attempt to exploit the growing information via a particular attention mechanism. In this paper, we propose a concise but effective Information-Growth Attention Network (IGAN) that shows the incremental information is beneficial for SR. Specifically, a novel information-growth attention is proposed. It aims to pay attention to features involving large information-growth capacity by assimilating the difference from current features to the former features within a network. We also illustrate its effectiveness contrasted by widely-used self-attention using entropy and generalization analysis. Furthermore, existing channel-wise attention generation modules (CAGMs) have large informational attenuation due to directly calculating global mean for feature maps. Therefore, we present an innovative CAGM that progressively decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115148779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heraclitus's Forest: An Interactive Artwork for Oral History","authors":"Lin Wang, Zhonghao Lin, Wei Cai","doi":"10.1145/3474085.3478544","DOIUrl":"https://doi.org/10.1145/3474085.3478544","url":null,"abstract":"Heraclitus's Forest is an interactive artwork that utilizes birch trees as a metaphor for the life stories recorded in an oral history database. We design a day/night cycle system to present the forest experience along the time elapse, multiple interaction modes to engage audiences' participation in history exploration, and evolving forest to arouse people's reflection on the feature of history, which is constantly being constructed but can never be returned to.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115231307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis","authors":"Ziqi Yuan, Wei Li, Hua Xu, Wenmeng Yu","doi":"10.1145/3474085.3475585","DOIUrl":"https://doi.org/10.1145/3474085.3475585","url":null,"abstract":"Improving robustness against data missing has become one of the core challenges in Multimodal Sentiment Analysis (MSA), which aims to judge speaker sentiments from the language, visual, and acoustic signals. In the current research, translation-based methods and tensor regularization methods are proposed for MSA with incomplete modality features. However, both of them fail to cope with random modality feature missing in non-aligned sequences. In this paper, a transformer-based feature reconstruction network (TFR-Net) is proposed to improve the robustness of models for the random missing in non-aligned modality sequences. First, intra-modal and inter-modal attention-based extractors are adopted to learn robust representations for each element in modality sequences. Then, a reconstruction module is proposed to generate the missing modality features. With the supervision of SmoothL1Loss between generated and complete sequences, TFR-Net is expected to learn semantic-level features corresponding to missing features. Extensive experiments on two public benchmark datasets show that our model achieves good results against data missing across various missing modality combinations and various missing degrees.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115481106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beibei Zhang, Fan Yu, Yanxin Gao, Tongwei Ren, Gangshan Wu
{"title":"Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion","authors":"Beibei Zhang, Fan Yu, Yanxin Gao, Tongwei Ren, Gangshan Wu","doi":"10.1145/3474085.3479214","DOIUrl":"https://doi.org/10.1145/3474085.3479214","url":null,"abstract":"To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116912328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tom Bartindale, Peter Chen, Harrison Marshall, Stanislav Pozdniakov, D. Richardson
{"title":"ZoomSense: A Scalable Infrastructure for Augmenting Zoom","authors":"Tom Bartindale, Peter Chen, Harrison Marshall, Stanislav Pozdniakov, D. Richardson","doi":"10.1145/3474085.3478332","DOIUrl":"https://doi.org/10.1145/3474085.3478332","url":null,"abstract":"We have seen a dramatic increase in the adoption of teleconferencing systems such as Zoom for remote teaching and working. Although designed primarily for traditional video conferencing scenarios, these platforms are actually being deployed in many diverse contexts. As such, Zoom offers little to aid hosts' understanding of attendee participation and often hinders participant agency. We introduce ZoomSense : an open-source, scalable infrastructure built upon 'virtual meeting participants', which exposes real-time meta-data, meeting content and host controls through an easy to use abstraction - so that developers can rapidly and sustainably augment Zoom.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"4 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120915028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su
{"title":"SVHAN: Sequential View Based Hierarchical Attention Network for 3D Shape Recognition","authors":"Yue Zhao, Weizhi Nie, Anan Liu, Zan Gao, Yuting Su","doi":"10.1145/3474085.3475371","DOIUrl":"https://doi.org/10.1145/3474085.3475371","url":null,"abstract":"As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. A lot of deep learning models have been proposed for effective 3D shape representation. The view-based methods show the superiority due to the comprehensive exploration of the visual characteristics with the help of established 2D CNN architectures. Generally, the current approaches contain the following disadvantages: First, the most majority of methods lack the consideration for sequential information among the multiple views, which can provide descriptive characteristics for shape representation. Second, the incomprehensive exploration for the multi-view correlations directly affects the discrimination of shape descriptors. Finally, roughly aggregating multi-view features leads to the loss of descriptive information, which limits the shape representation effectiveness. To handle these issues, we propose a novel sequential view based hierarchical attention network (SVHAN) for 3D shape recognition. Specifically, we first divide the view sequence into several view blocks. Then, we introduce a novel hierarchical feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level, and shape-level features, the intra- and inter- view-block correlations are also captured to improve the discrimination of learned features. Subsequently, a novel selective fusion module (SFM) is designed for feature aggregation, considering the correlations between different levels and preserving effective information. Finally, discriminative and informative shape descriptors are generated for the recognition task. We validate the effectiveness of our proposed method on two public databases. The experimental results show the superiority of SVHAN against the current state-of-the-art approaches.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127356571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M3TR: Multi-modal Multi-label Recognition with Transformer","authors":"Jiawei Zhao, Yifan Zhao, Jia Li","doi":"10.1145/3474085.3475191","DOIUrl":"https://doi.org/10.1145/3474085.3475191","url":null,"abstract":"Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang
{"title":"Multimodal Relation Extraction with Efficient Graph Alignment","authors":"Changmeng Zheng, Junhao Feng, Ze Fu, Yiru Cai, Qing Li, Tao Wang","doi":"10.1145/3474085.3476968","DOIUrl":"https://doi.org/10.1145/3474085.3476968","url":null,"abstract":"Relation extraction (RE) is a fundamental process in constructing knowledge graphs. However, previous methods on relation extraction suffer sharp performance decline in short and noisy social media texts due to a lack of contexts. Fortunately, the related visual contents (objects and their relations) in social media posts can supplement the missing semantics and help to extract relations precisely. We introduce the multimodal relation extraction (MRE), a task that identifies textual relations with visual clues. To tackle this problem, we present a large-scale dataset which contains 15000+ sentences with 23 pre-defined relation categories. Considering that the visual relations among objects are corresponding to textual relations, we develop a dual graph alignment method to capture this correlation for better performance. Experimental results demonstrate that visual contents help to identify relations more precisely against the text-only baselines. Besides, our alignment method can find the correlations between vision and language, resulting in better performance. Our dataset and code are available at https://github.com/thecharm/Mega.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123223809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text2Video: Automatic Video Generation Based on Text Scripts","authors":"Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun","doi":"10.1145/3474085.3478548","DOIUrl":"https://doi.org/10.1145/3474085.3478548","url":null,"abstract":"To make video creation simpler, in this paper we present Text2Video, a novel system to automatically produce videos using only text-editing for novice users. Given an input text script, the director-like system can generate game-related engaging videos which illustrate the given narrative, provide diverse multi-modal content, and follow video editing guidelines. The system involves five modules: (1) A material manager extracts highlights from raw live game videos, and tags each video highlight, image and audio with labels. (2) A natural language processor extracts entities and semantics from the input text scripts. (3) A refined cross-modal retrieval searches for matching candidate shots from the material manager. (4) A text to speech speaker reads the processed text scripts with synthesized human voice. (5) The selected material shots and synthesized speech are assembled artistically through appropriate video editing techniques.","PeriodicalId":357468,"journal":{"name":"Proceedings of the 29th ACM International Conference on Multimedia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114935070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}