Xianzhi Zhang, Yipeng Zhou, Di Wu, Quan Z. Sheng, Miao Hu, Linchang Xiao
{"title":"PPVF: An Efficient Privacy-Preserving Online Video Fetching Framework with Correlated Differential Privacy","authors":"Xianzhi Zhang, Yipeng Zhou, Di Wu, Quan Z. Sheng, Miao Hu, Linchang Xiao","doi":"arxiv-2408.14735","DOIUrl":"https://doi.org/arxiv-2408.14735","url":null,"abstract":"Online video streaming has evolved into an integral component of the\u0000contemporary Internet landscape. Yet, the disclosure of user requests presents\u0000formidable privacy challenges. As users stream their preferred online videos,\u0000their requests are automatically seized by video content providers, potentially\u0000leaking users' privacy. Unfortunately, current protection methods are not well-suited to preserving\u0000user request privacy from content providers while maintaining high-quality\u0000online video services. To tackle this challenge, we introduce a novel\u0000Privacy-Preserving Video Fetching (PPVF) framework, which utilizes trusted edge\u0000devices to pre-fetch and cache videos, ensuring the privacy of users' requests\u0000while optimizing the efficiency of edge caching. More specifically, we design\u0000PPVF with three core components: (1) textit{Online privacy budget scheduler},\u0000which employs a theoretically guaranteed online algorithm to select\u0000non-requested videos as candidates with assigned privacy budgets. Alternative\u0000videos are chosen by an online algorithm that is theoretically guaranteed to\u0000consider both video utilities and available privacy budgets. (2) textit{Noisy\u0000video request generator}, which generates redundant video requests (in addition\u0000to original ones) utilizing correlated differential privacy to obfuscate\u0000request privacy. (3) textit{Online video utility predictor}, which leverages\u0000federated learning to collaboratively evaluate video utility in an online\u0000fashion, aiding in video selection in (1) and noise generation in (2). Finally,\u0000we conduct extensive experiments using real-world video request traces from\u0000Tencent Video. The results demonstrate that PPVF effectively safeguards user\u0000request privacy while upholding high video caching performance.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
{"title":"Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization","authors":"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara","doi":"arxiv-2408.14547","DOIUrl":"https://doi.org/arxiv-2408.14547","url":null,"abstract":"The conventional training approach for image captioning involves pre-training\u0000a network using teacher forcing and subsequent fine-tuning with Self-Critical\u0000Sequence Training to maximize hand-crafted captioning metrics. However, when\u0000attempting to optimize modern and higher-quality metrics like CLIP-Score and\u0000PAC-Score, this training method often encounters instability and fails to\u0000acquire the genuine descriptive capabilities needed to produce fluent and\u0000informative captions. In this paper, we propose a new training paradigm termed\u0000Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and\u0000optimizes a reward model that is distilled from a learnable captioning\u0000evaluator with high human correlation. This is done by solving a weighted\u0000classification problem directly inside the captioner. At the same time, DiCO\u0000prevents divergence from the original model, ensuring that fluency is\u0000maintained. DiCO not only exhibits improved stability and enhanced quality in\u0000the generated captions but also aligns more closely with human preferences\u0000compared to existing methods, especially in modern metrics. Additionally, it\u0000maintains competitive performance in traditional metrics. Our source code and\u0000trained models are publicly available at https://github.com/aimagelab/DiCO.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Digital Fingerprinting on Multimedia: A Survey","authors":"Wendi Chen, Wensheng Gan, Philip S. Yu","doi":"arxiv-2408.14155","DOIUrl":"https://doi.org/arxiv-2408.14155","url":null,"abstract":"The explosive growth of multimedia content in the digital economy era has\u0000brought challenges in content recognition, copyright protection, and data\u0000management. As an emerging content management technology, perceptual hash-based\u0000digital fingerprints, serving as compact summaries of multimedia content, have\u0000been widely adopted for efficient multimedia content identification and\u0000retrieval across different modalities (e.g., text, image, video, audio),\u0000attracting significant attention from both academia and industry. Despite the\u0000increasing applications of digital fingerprints, there is a lack of systematic\u0000and comprehensive literature review on multimedia digital fingerprints. This\u0000survey aims to fill this gap and provide an important resource for researchers\u0000studying the details and related advancements of multimedia digital\u0000fingerprints. The survey first introduces the definition, characteristics, and\u0000related concepts (including hash functions, granularity, similarity measures,\u0000etc.) of digital fingerprints. It then focuses on analyzing and summarizing the\u0000algorithms for extracting unimodal fingerprints of different types of digital\u0000content, including text fingerprints, image fingerprints, video fingerprints,\u0000and audio fingerprints. Particularly, it provides an in-depth review and\u0000summary of deep learning-based fingerprints. Additionally, the survey\u0000elaborates on the various practical applications of digital fingerprints and\u0000outlines the challenges and potential future research directions. The goal is\u0000to promote the continued development of multimedia digital fingerprint\u0000research.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HABD: a houma alliance book ancient handwritten character recognition database","authors":"Xiaoyu Yuan, Xiaohua Huang, Zibo Zhang, Yabo Sun","doi":"arxiv-2408.14084","DOIUrl":"https://doi.org/arxiv-2408.14084","url":null,"abstract":"The Houma Alliance Book, one of history's earliest calligraphic examples, was\u0000unearthed in the 1970s. These artifacts were meticulously organized,\u0000reproduced, and copied by the Shanxi Provincial Institute of Cultural Relics.\u0000However, because of their ancient origins and severe ink erosion, identifying\u0000characters in the Houma Alliance Book is challenging, necessitating the use of\u0000digital technology. In this paper, we propose a new ancient handwritten\u0000character recognition database for the Houma alliance book, along with a novel\u0000benchmark based on deep learning architectures. More specifically, a collection\u0000of 26,732 characters samples from the Houma Alliance Book were gathered,\u0000encompassing 327 different types of ancient characters through iterative\u0000annotation. Furthermore, benchmark algorithms were proposed by combining four\u0000deep neural network classifiers with two data augmentation methods. This\u0000research provides valuable resources and technical support for further studies\u0000on the Houma Alliance Book and other ancient characters. This contributes to\u0000our understanding of ancient culture and history, as well as the preservation\u0000and inheritance of humanity's cultural heritage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anmol Manjunath, Viola Negroni, Sara Mandelli, Daniel Moreira, Paolo Bestagini
{"title":"Localization of Synthetic Manipulations in Western Blot Images","authors":"Anmol Manjunath, Viola Negroni, Sara Mandelli, Daniel Moreira, Paolo Bestagini","doi":"arxiv-2408.13786","DOIUrl":"https://doi.org/arxiv-2408.13786","url":null,"abstract":"Recent breakthroughs in deep learning and generative systems have\u0000significantly fostered the creation of synthetic media, as well as the local\u0000alteration of real content via the insertion of highly realistic synthetic\u0000manipulations. Local image manipulation, in particular, poses serious\u0000challenges to the integrity of digital content and societal trust. This problem\u0000is not only confined to multimedia data, but also extends to biological images\u0000included in scientific publications, like images depicting Western blots. In\u0000this work, we address the task of localizing synthetic manipulations in Western\u0000blot images. To discriminate between pristine and synthetic pixels of an\u0000analyzed image, we propose a synthetic detector that operates on small patches\u0000extracted from the image. We aggregate patch contributions to estimate a\u0000tampering heatmap, highlighting synthetic pixels out of pristine ones. Our\u0000methodology proves effective when tested over two manipulated Western blot\u0000image datasets, one altered automatically and the other manually by exploiting\u0000advanced AI-based image manipulation tools that are unknown at our training\u0000stage. We also explore the robustness of our method over an external dataset of\u0000other scientific images depicting different semantics, manipulated through\u0000unseen generation techniques.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description","authors":"Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, Zhiyong Wu","doi":"arxiv-2408.13608","DOIUrl":"https://doi.org/arxiv-2408.13608","url":null,"abstract":"Speech-language multi-modal learning presents a significant challenge due to\u0000the fine nuanced information inherent in speech styles. Therefore, a\u0000large-scale dataset providing elaborate comprehension of speech style is\u0000urgently needed to facilitate insightful interplay between speech audio and\u0000natural language. However, constructing such datasets presents a major\u0000trade-off between large-scale data collection and high-quality annotation. To\u0000tackle this challenge, we propose an automatic speech annotation system for\u0000expressiveness interpretation that annotates in-the-wild speech clips with\u0000expressive and vivid human language descriptions. Initially, speech audios are\u0000processed by a series of expert classifiers and captioning models to capture\u0000diverse speech characteristics, followed by a fine-tuned LLaMA for customized\u0000annotation generation. Unlike previous tag/templet-based annotation frameworks\u0000with limited information and diversity, our system provides in-depth\u0000understandings of speech style through tailored natural language descriptions,\u0000thereby enabling accurate and voluminous data generation for large model\u0000training. With this system, we create SpeechCraft, a fine-grained bilingual\u0000expressive speech dataset. It is distinguished by highly descriptive natural\u0000language style prompts, containing approximately 2,000 hours of audio data and\u0000encompassing over two million speech clips. Extensive experiments demonstrate\u0000that the proposed dataset significantly boosts speech-language task performance\u0000in stylist speech synthesis and speech style understanding.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation","authors":"Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang","doi":"arxiv-2408.05090","DOIUrl":"https://doi.org/arxiv-2408.05090","url":null,"abstract":"Vision and Language Navigation (VLN) is a challenging task that requires\u0000agents to understand instructions and navigate to the destination in a visual\u0000environment.One of the key challenges in outdoor VLN is keeping track of which\u0000part of the instruction was completed. To alleviate this problem, previous\u0000works mainly focus on grounding the natural language to the visual input, but\u0000neglecting the crucial role of the agent's spatial position information in the\u0000grounding process. In this work, we first explore the substantial effect of\u0000spatial position locating on the grounding of outdoor VLN, drawing inspiration\u0000from human navigation. In real-world navigation scenarios, before planning a\u0000path to the destination, humans typically need to figure out their current\u0000location. This observation underscores the pivotal role of spatial localization\u0000in the navigation process. In this work, we introduce a novel framework,\u0000Locating be for Planning (Loc4Plan), designed to incorporate spatial perception\u0000for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to\u0000perform the spatial localization before planning a decision action based on\u0000corresponding guidance, which comprises a block-aware spatial locating (BAL)\u0000module and a spatial-aware action planning (SAP) module. Specifically, to help\u0000the agent perceive its spatial location in the environment, we propose to learn\u0000a position predictor that measures how far the agent is from the next\u0000intersection for reflecting its position, which is achieved by the BAL module.\u0000After the locating process, we propose the SAP module to incorporate spatial\u0000information to ground the corresponding guidance and enhance the precision of\u0000action planning. Extensive experiments on the Touchdown and map2seq datasets\u0000show that the proposed Loc4Plan outperforms the SOTA methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep joint source-channel coding for wireless point cloud transmission","authors":"Cixiao Zhang, Mufan Liu, Wenjie Huang, Yin Xu, Yiling Xu, Dazhi He","doi":"arxiv-2408.04889","DOIUrl":"https://doi.org/arxiv-2408.04889","url":null,"abstract":"The growing demand for high-quality point cloud transmission over wireless\u0000networks presents significant challenges, primarily due to the large data sizes\u0000and the need for efficient encoding techniques. In response to these\u0000challenges, we introduce a novel system named Deep Point Cloud Semantic\u0000Transmission (PCST), designed for end-to-end wireless point cloud transmission.\u0000Our approach employs a progressive resampling framework using sparse\u0000convolution to project point cloud data into a semantic latent space. These\u0000semantic features are subsequently encoded through a deep joint source-channel\u0000(JSCC) encoder, generating the channel-input sequence. To enhance transmission\u0000efficiency, we use an adaptive entropy-based approach to assess the importance\u0000of each semantic feature, allowing transmission lengths to vary according to\u0000their predicted entropy. PCST is robust across diverse Signal-to-Noise Ratio\u0000(SNR) levels and supports an adjustable rate-distortion (RD) trade-off,\u0000ensuring flexible and efficient transmission. Experimental results indicate\u0000that PCST significantly outperforms traditional separate source-channel coding\u0000(SSCC) schemes, delivering superior reconstruction quality while achieving over\u0000a 50% reduction in bandwidth usage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation","authors":"Haoxiang Shi, Ziqi Liang, Jun Yu","doi":"arxiv-2408.04547","DOIUrl":"https://doi.org/arxiv-2408.04547","url":null,"abstract":"Emotion Prediction in Conversation (EPC) aims to forecast the emotions of\u0000forthcoming utterances by utilizing preceding dialogues. Previous EPC\u0000approaches relied on simple context modeling for emotion extraction,\u0000overlooking fine-grained emotion cues at the word level. Additionally, prior\u0000works failed to account for the intrinsic differences between modalities,\u0000resulting in redundant information. To overcome these limitations, we propose\u0000an emotional cues extraction and fusion network, which consists of two stages:\u0000a modality-specific learning stage that utilizes word-level labels and prosody\u0000learning to construct emotion embedding spaces for each modality, and a\u0000two-step fusion stage for integrating multi-modal features. Moreover, the\u0000emotion features extracted by our model are also applicable to the Emotion\u0000Recognition in Conversation (ERC) task. Experimental results validate the\u0000efficacy of the proposed method, demonstrating superior performance on both\u0000IEMOCAP and MELD datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangcheng Du, Zhao Zhou, Yanlong Wang, Zhuoyao Wang, Yingbin Zheng, Cheng Jin
{"title":"MultiColor: Image Colorization by Learning from Multiple Color Spaces","authors":"Xiangcheng Du, Zhao Zhou, Yanlong Wang, Zhuoyao Wang, Yingbin Zheng, Cheng Jin","doi":"arxiv-2408.04172","DOIUrl":"https://doi.org/arxiv-2408.04172","url":null,"abstract":"Deep networks have shown impressive performance in the image restoration\u0000tasks, such as image colorization. However, we find that previous approaches\u0000rely on the digital representation from single color model with a specific\u0000mapping function, a.k.a., color space, during the colorization pipeline. In\u0000this paper, we first investigate the modeling of different color spaces, and\u0000find each of them exhibiting distinctive characteristics with unique\u0000distribution of colors. The complementarity among multiple color spaces leads\u0000to benefits for the image colorization task. We present MultiColor, a new learning-based approach to automatically\u0000colorize grayscale images that combines clues from multiple color spaces.\u0000Specifically, we employ a set of dedicated colorization modules for individual\u0000color space. Within each module, a transformer decoder is first employed to\u0000refine color query embeddings and then a color mapper produces color channel\u0000prediction using the embeddings and semantic features. With these predicted\u0000color channels representing various color spaces, a complementary network is\u0000designed to exploit the complementarity and generate pleasing and reasonable\u0000colorized images. We conduct extensive experiments on real-world datasets, and\u0000the results demonstrate superior performance over the state-of-the-arts.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}