T. Haruyama, Sho Takahashi, Takahiro Ogawa, M. Haseyama
{"title":"Similar scene retrieval in soccer videos with weak annotations by multimodal use of bidirectional LSTM","authors":"T. Haruyama, Sho Takahashi, Takahiro Ogawa, M. Haseyama","doi":"10.1145/3444685.3446280","DOIUrl":"https://doi.org/10.1145/3444685.3446280","url":null,"abstract":"This paper presents a novel method to retrieve similar scenes in soccer videos with weak annotations via multimodal use of bidirectional long short-term memory (BiLSTM). The significant increase in the number of different types of soccer videos with the development of technology brings valid assets for effective coaching, but it also increases the work of players and training staff. We tackle this problem with a nontraditional combination of pre-trained models for feature extraction and BiLSTMs for feature transformation. By using the pre-trained models, no training data is required for feature extraction. Then effective feature transformation for similarity calculation is performed by applying BiLSTM trained with weak annotations. This transformation allows for highly accurate capture of soccer video context from less annotation work. In this paper, we achieve an accurate retrieval of similar scenes by multimodal use of this BiLSTM-based transformer trainable with less human effort. The effectiveness of our method was verified by comparative experiments with state-of-the-art using actual soccer video dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124449118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-quality watermarked face inpainting with discriminative residual learning","authors":"Zheng He, Xueli Wei, Kangli Zeng, Zhen Han, Qin Zou, Zhongyuan Wang","doi":"10.1145/3444685.3446261","DOIUrl":"https://doi.org/10.1145/3444685.3446261","url":null,"abstract":"Most existing image inpainting methods assume that the location of the repair area (watermark) is known, but this assumption does not always hold. In addition, the actual watermarked face is in a compressed low-quality form, which is very disadvantageous to the repair due to compression distortion effects. To address these issues, this paper proposes a low-quality watermarked face inpainting method based on joint residual learning with cooperative discriminant network. We first employ residual learning based global inpainting and facial features based local inpainting to render clean and clear faces under unknown watermark positions. Because the repair process may distort the genuine face, we further propose a discriminative constraint network to maintain the fidelity of repaired faces. Experimentally, the average PSNR of inpainted face images is increased by 4.16dB, and the average SSIM is increased by 0.08. TPR is improved by 16.96% when FPR is 10% in face verification.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127980973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Wang, Xi Zhang, Chen Wang, Qing Zhu, Baocai Yin
{"title":"Two-stage structure aware image inpainting based on generative adversarial networks","authors":"Jin Wang, Xi Zhang, Chen Wang, Qing Zhu, Baocai Yin","doi":"10.1145/3444685.3446260","DOIUrl":"https://doi.org/10.1145/3444685.3446260","url":null,"abstract":"In recent years, the image inpainting technology based on deep learning has made remarkable progress, which can better complete the complex image inpainting task compared with traditional methods. However, most of the existing methods can not generate reasonable structure and fine texture details at the same time. To solve this problem, in this paper we propose a two-stage image inpainting method with structure awareness based on Generative Adversarial Networks, which divides the inpainting process into two sub tasks, namely, image structure generation and image content generation. In the former stage, the network generates the structural information of the missing area; while in the latter stage, the network uses this structural information as a prior, and combines the existing texture and color information to complete the image. Extensive experiments are conducted to evaluate the performance of our proposed method on Places2, CelebA and Paris Streetview datasets. The experimental results show the superior performance of the proposed method compared with other state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134539340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoyu Tang, Jihua Zhu, Zan Gao, Tao Zhuo, Zhiyong Cheng
{"title":"Attention feature matching for weakly-supervised video relocalization","authors":"Haoyu Tang, Jihua Zhu, Zan Gao, Tao Zhuo, Zhiyong Cheng","doi":"10.1145/3444685.3446317","DOIUrl":"https://doi.org/10.1145/3444685.3446317","url":null,"abstract":"Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video. Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Tang, Lin Zhao, Zhaoliang Yao, Chen Gong, Jian Yang
{"title":"Graph-based motion prediction for abnormal action detection","authors":"Yao Tang, Lin Zhao, Zhaoliang Yao, Chen Gong, Jian Yang","doi":"10.1145/3444685.3446316","DOIUrl":"https://doi.org/10.1145/3444685.3446316","url":null,"abstract":"Abnormal action detection is the most noteworthy part of anomaly detection, which tries to identify unusual human behaviors in videos. Previous methods typically utilize future frame prediction to detect frames deviating from the normal scenario. While this strategy enjoys success in the accuracy of anomaly detection, critical information such as the cause and location of the abnormality is unable to be acquired. This paper proposes human motion prediction for abnormal action detection. We employ sequence of human poses to represent human motion, and detect irregular behavior by comparing the predicted pose with the actual pose detected in the frame. Hence the proposed method is able to explain why the action is regarded as irregularity and locate where the anomaly happens. Moreover, pose sequence is robust to noise, complex background and small targets in videos. Since posture information is non-Euclidean data, graph convolutional network is adopted for future pose prediction, which not only leads to greater expressive power but also stronger generalization capability. Experiments are conducted both on the widely used anomaly detection dataset ShanghaiTech and our newly proposed dataset NJUST-Anomaly, which mainly contains irregular behaviors happened in the campus. Our dataset expands the existing datasets by giving more abnormal actions attracting public attention in social security, which happen in more complex scenes and dynamic backgrounds. Experimental results on both datasets demonstrate the superiority of our method over the-state-of-the-art methods. The source code and NJUST-Anomaly dataset will be made public at https://github.com/datangzhengqing/MP-GCN.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126719536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiplicative angular margin loss for text-based person search","authors":"Peng Zhang, Deqiang Ouyang, Feiyu Chen, Jie Shao","doi":"10.1145/3444685.3446314","DOIUrl":"https://doi.org/10.1145/3444685.3446314","url":null,"abstract":"Text-based person search aims at retrieving the most relevant pedestrian images from database in response to a query in form of natural language description. Existing algorithms mainly focus on embedding textual and visual features into a common semantic space so that the similarity score of features from different modalities can be computed directly. Softmax loss is widely adopted to classify textual and visual features into a correct category in the joint embedding space. However, softmax loss can only help classify features but not increase the intra-class compactness and inter-class discrepancy. To this end, we propose multiplicative angular margin (MAM) loss to learn angularly discriminative features for each identity. The multiplicative angular margin loss penalizes the angle between feature vector and its corresponding classifier vector to learn more discriminative feature. Moreover, to focus more on informative image-text pair, we propose pairwise similarity weighting (PSW) loss to assign higher weight to informative pairs. Extensive experimental evaluations have been conducted on the CUHK-PEDES dataset over our proposed losses. The results show the superiority of our proposed method. Code is available at https://github.com/pengzhanguestc/MAM_loss.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114232488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yupeng Cheng, Xingxing Wei, H. Fu, Shang-Wei Lin, Weisi Lin
{"title":"Defense for adversarial videos by self-adaptive JPEG compression and optical texture","authors":"Yupeng Cheng, Xingxing Wei, H. Fu, Shang-Wei Lin, Weisi Lin","doi":"10.1145/3444685.3446308","DOIUrl":"https://doi.org/10.1145/3444685.3446308","url":null,"abstract":"Despite demonstrated outstanding effectiveness in various computer vision tasks, Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Nowadays, adversarial attacks as well as their defenses w.r.t. DNNs in image domain have been intensively studied, and there are some recent works starting to explore adversarial attacks w.r.t. DNNs in video domain. However, the corresponding defense is rarely studied. In this paper, we propose a new two-stage framework for defending video adversarial attack. It contains two main components, namely self-adaptive Joint Photographic Experts Group (JPEG) compression defense and optical texture based defense (OTD). In self-adaptive JPEG compression defense, we propose to adaptively choose an appropriate JPEG quality based on an estimation of moving foreground object, such that the JPEG compression could depress most impact of adversarial noise without losing too much video quality. In OTD, we generate \"optical texture\" containing high-frequency information based on the optical flow map, and use it to edit Y channel (in YCrCb color space) of input frames, thus further reducing the influence of adversarial perturbation. Experimental results on a benchmark dataset demonstrate the effectiveness of our framework in recovering the classification performance on perturbed videos.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"35 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116616076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shota Ashida, A. Jatowt, A. Doucet, Masatoshi Yoshikawa
{"title":"Determining image age with rank-consistent ordinal classification and object-centered ensemble","authors":"Shota Ashida, A. Jatowt, A. Doucet, Masatoshi Yoshikawa","doi":"10.1145/3444685.3446326","DOIUrl":"https://doi.org/10.1145/3444685.3446326","url":null,"abstract":"A significant number of old photographs including ones that are posted online do not contain the information of the date at which they were taken, or this information needs to be verified. Many of such pictures are either scanned analog photographs or photographs taken using a digital camera with incorrect settings. Estimating the date of such pictures is useful for enhancing data quality and its consistency, improving information retrieval and for other related applications. In this study, we propose a novel approach for automatic estimation of the shooting dates of photographs based on a rank-consistent ordinal classification method for neural networks. We also introduce an ensemble approach that involves object segmentation. We conclude that assuring the rank consistency in the ordinal classification as well as combining models trained on segmented objects improve the results of the age determination task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131526715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relationship graph learning network for visual relationship detection","authors":"Yanan Li, Jun Yu, Yibing Zhan, Zhi Chen","doi":"10.1145/3444685.3446312","DOIUrl":"https://doi.org/10.1145/3444685.3446312","url":null,"abstract":"Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133276438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying-Jian Liu, Heng Zhang, Xiao-Long Yun, Jun-Yu Ye, Cheng-Lin Liu
{"title":"Table detection and cell segmentation in online handwritten documents with graph attention networks","authors":"Ying-Jian Liu, Heng Zhang, Xiao-Long Yun, Jun-Yu Ye, Cheng-Lin Liu","doi":"10.1145/3444685.3446295","DOIUrl":"https://doi.org/10.1145/3444685.3446295","url":null,"abstract":"In this paper, we propose a multi-task learning approach for table detection and cell segmentation with densely connected graph attention networks in free form online documents. Each online document is regarded as a graph, where nodes represent strokes and edges represent the relationships between strokes. Then we propose a graph attention network model to classify nodes and edges simultaneously. According to node classification results, tables can be detected in each document. By combining node and edge classification resutls, cells in each table can be segmented. To improve information flow in the network and enable efficient reuse of features among layers, dense connectivity among layers is used. Our proposed model has been experimentally validated on an online handwritten document dataset IAMOnDo and achieved encouraging results.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133941461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}