2023 IEEE International Conference on Multimedia and Expo (ICME)最新文献_第5页

ERPG: Enhancing Entity Representations with Prompt Guidance for Complex Named Entity Recognition ERPG:为复杂命名实体识别增强实体表示与提示指导

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00478

Xingyu Zhu, Feifei Dai, Xiaoyan Gu, Haihui Fan, B. Li, Weiping Wang

引用次数: 0

Variational Information Bottleneck for Cross Domain Object Detection 跨域目标检测的变分信息瓶颈

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00381

Jiangming Chen, Wanxia Deng, Bo Peng, Tianpeng Liu, Yingmei Wei, Li Liu

引用次数: 0

Image Compressed Sensing Using Multi-Scale Characteristic Residual Learning 基于多尺度特征残差学习的图像压缩感知

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00275

Shumian Yang, Xinxin Xiang, Fenghua Tong, Dawei Zhao, Xin Li

引用次数: 0

Learning High Frequency Surface Functions In Shells 在壳中学习高频表面函数

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00112

Han Guo, Yuanlong Yu, Yujie Wang, Xuelin Chen, Yixin Zhuang

引用次数: 0

DSP-Net: Diverse Structure Prior Network for Image Inpainting DSP-Net:用于图像绘制的多结构先验网络

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00088

Lin Sun, Chao Yang, Bin Jiang

引用次数: 0

Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning 用于视听零射击学习的模态融合尖峰变压器网络

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00080

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, Xiaopeng Fan

{"title":"Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning","authors":"Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, Xiaopeng Fan","doi":"10.1109/ICME55011.2023.00080","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00080","url":null,"abstract":"Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer’s ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audio-visual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RASNet: A Reinforcement Assistant Network for Frame Selection in Video-based Posture Recognition RASNet:基于视频的姿态识别中帧选择的强化辅助网络

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00366

Ruotong Hu, Xianzhi Wang, Xiaojun Chang, Yeqi Hu, Xiaowei Xin, Xiangqian Ding, Baoqi Guo

引用次数: 0

DiST-GAN: Distillation-based Semantic Transfer for Text-Guided Face Generation 基于提取的文本引导人脸生成语义转移

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00149

Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu

{"title":"DiST-GAN: Distillation-based Semantic Transfer for Text-Guided Face Generation","authors":"Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu","doi":"10.1109/ICME55011.2023.00149","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00149","url":null,"abstract":"Recently, large-scale pre-training has achieved great success in multi-modal tasks and shown powerful generalization ability due to superior semantic comprehension. In the field of text-to-image synthesis, recent works induce large-scale pre-training with VQ-VAE as a discrete visual tokenizer, which can synthesize realistic images from arbitrary text inputs. However, the quality of images generated by these methods is still inferior to that of images generated by GAN-based methods, especially in some specific domains. To leverage both the superior semantic comprehension of large-scale pre-training models and the powerful ability of GAN-based models in photorealistic image generation, we propose a novel knowledge distillation framework termed DiST-GAN to transfer the semantic knowledge of large-scale visual-language pre-training models (e.g., CLIP) to GAN-based generator for text-guided face image generation. Our DiST-GAN consists of two key components: (1) A new CLIP-based adaptive contrastive loss is devised to ensure the generated images are consistent with the input texts. (2) A language-to-vision (L2V) transformation module is learned to transform token embeddings of each text into an intermediate embedding that is aligned with the image embedding extracted by CLIP. With these two novel components, the semantic knowledge contained in CLIP can thus be transferred to GAN-based generator which preserves the superior ability of photorealistic image generation in the mean time. Extensive results on the Multi-Modal CelebA-HQ dataset show that our DiST-GAN achieves significant improvements over the state-of-the-arts.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117326015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Geometrical Characterization on Feature Density of Image Datasets 图像数据集特征密度的几何表征

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00313

Zhen Liang, Changyuan Zhao, Wanwei Liu, Bai Xue, Wenjing Yang

引用次数: 0

Optimizing Video Streaming for Sustainability and Quality: The Role of Preset Selection in Per-Title Encoding 优化视频流的可持续性和质量:预设选择在标题编码中的作用

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI: 10.1109/ICME55011.2023.00289

Hadi Amirpour, V. V. Menon, Samira Afzal, R.-C. Prodan, C. Timmerer

{"title":"Optimizing Video Streaming for Sustainability and Quality: The Role of Preset Selection in Per-Title Encoding","authors":"Hadi Amirpour, V. V. Menon, Samira Afzal, R.-C. Prodan, C. Timmerer","doi":"10.1109/ICME55011.2023.00289","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00289","url":null,"abstract":"HTTP Adaptive Streaming (HAS) methods divide a video into smaller segments, encoded at multiple pre-defined bitrates to construct a bitrate ladder. Bitrate ladders are usually optimized per title over several dimensions, such as bitrate, resolution, and framerate. This paper adds a new dimension to the bitrate ladder by considering the energy consumption of the encoding process. Video encoders often have multiple pre-defined presets to balance the trade-off between encoding time, energy consumption, and compression efficiency. Faster presets disable certain coding tools defined by the codec to reduce the encoding time at the cost of reduced compression efficiency. Firstly, this paper evaluates the energy consumption and compression efficiency of different x265 presets for 500 video sequences. Secondly, optimized presets are selected for various representations in a bitrate ladder based on the results to guarantee a minimal drop in video quality while saving energy. Finally, a new per title model, which optimizes the trade-off between compression efficiency and energy consumption, is proposed. The experimental results show that decreasing the VMAF score by 0.15 and 0.39 while choosing an optimized preset results in encoding energy savings of 70% and 83%, respectively.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0