Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

筛选
英文 中文
Improve Image Captioning by Modeling Dynamic Scene Graph Extension 通过建模动态场景图扩展改进图像字幕
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531401
Minghao Geng, Qingjie Zhao
{"title":"Improve Image Captioning by Modeling Dynamic Scene Graph Extension","authors":"Minghao Geng, Qingjie Zhao","doi":"10.1145/3512527.3531401","DOIUrl":"https://doi.org/10.1145/3512527.3531401","url":null,"abstract":"Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123311027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification 基于视频的人再识别的时间一致视觉线索注意网络
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531362
Bingliang Jiao, Liying Gao, Peng Wang
{"title":"Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification","authors":"Bingliang Jiao, Liying Gao, Peng Wang","doi":"10.1145/3512527.3531362","DOIUrl":"https://doi.org/10.1145/3512527.3531362","url":null,"abstract":"Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123917842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Impact of Dataset Splits on Classification Performance in Medical Videos 数据集分割对医学视频分类性能的影响
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531424
Markus Fox, Klaus Schoeffmann
{"title":"The Impact of Dataset Splits on Classification Performance in Medical Videos","authors":"Markus Fox, Klaus Schoeffmann","doi":"10.1145/3512527.3531424","DOIUrl":"https://doi.org/10.1145/3512527.3531424","url":null,"abstract":"The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121060847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MuLER: Multiplet-Loss for Emotion Recognition 情感识别的多重损失
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531406
Anwer Slimi, M. Zrigui, H. Nicolas
{"title":"MuLER: Multiplet-Loss for Emotion Recognition","authors":"Anwer Slimi, M. Zrigui, H. Nicolas","doi":"10.1145/3512527.3531406","DOIUrl":"https://doi.org/10.1145/3512527.3531406","url":null,"abstract":"With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation 面向推荐的异构信息网络双向有效元路径编码器
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531402
Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang
{"title":"An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation","authors":"Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang","doi":"10.1145/3512527.3531402","DOIUrl":"https://doi.org/10.1145/3512527.3531402","url":null,"abstract":"Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames 用集中注意力总结视频并考虑视频帧的独特性和多样性
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531404
Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras
{"title":"Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames","authors":"Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras","doi":"10.1145/3512527.3531404","DOIUrl":"https://doi.org/10.1145/3512527.3531404","url":null,"abstract":"In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization 瞬间定位的联合模态协同与时空线索净化
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531396
Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo
{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":"https://doi.org/10.1145/3512527.3531396","url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation 基于边界特征变换的弱监督语义分割的跨像素依赖
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531360
Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng
{"title":"Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation","authors":"Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng","doi":"10.1145/3512527.3531360","DOIUrl":"https://doi.org/10.1145/3512527.3531360","url":null,"abstract":"Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia MMArt-ACM 2022:第五届多媒体艺术作品分析与吸引力计算联合研讨会
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531442
Naoko Nitta, Anita Hu, Kensuke Tobitani
{"title":"MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia","authors":"Naoko Nitta, Anita Hu, Kensuke Tobitani","doi":"10.1145/3512527.3531442","DOIUrl":"https://doi.org/10.1145/3512527.3531442","url":null,"abstract":"In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124858684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation MultiCLU:店面可达性检测与评价的多阶段语境学习与利用
Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531361
X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu
{"title":"MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation","authors":"X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu","doi":"10.1145/3512527.3531361","DOIUrl":"https://doi.org/10.1145/3512527.3531361","url":null,"abstract":"In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信