Deep Gated Multi-modal Fusion for Image Privacy Prediction

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web Pub Date : 2023-07-22 DOI:https://dl.acm.org/doi/10.1145/3608446

Chenye Zhao, Cornelia Caragea

{"title":"Deep Gated Multi-modal Fusion for Image Privacy Prediction","authors":"Chenye Zhao, Cornelia Caragea","doi":"https://dl.acm.org/doi/10.1145/3608446","DOIUrl":null,"url":null,"abstract":"<p>With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.</p>","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"42 36","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3608446","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.

查看原文本刊更多论文

图像隐私预测的深门控多模态融合

随着移动设备技术的飞速发展，人们可以在Facebook、Flickr、Instagram等社交网站上发布自己的日常生活。这导致了新的隐私问题，因为人们不了解私人信息可能被泄露并被用来损害他们的利益。图像隐私预测模型用于预测图像是否包含敏感信息(私有图像)或是否可以安全地在线共享(公共图像)。尽管这项任务取得了重大进展，但仍有一些关键问题有待解决。首先，发现图像的内容和标签是自动预测图像隐私的有用模式。迄今为止，大多数图像隐私预测模型使用单一模式(仅图像或仅标记)，这限制了它们的性能。其次，我们观察到当前的图像隐私预测模型非常容易受到输入数据中的微小扰动的影响。攻击者可以在输入数据中添加微小的扰动，很容易破坏训练有素的图像隐私预测模型。为了解决这些挑战，在本文中，我们提出了一种新的决策级门控多模态融合(GMMF)方法，该方法融合了对象、场景和图像标签的模态来预测在线图像的隐私。特别是，该方法根据单模态分类器对每个目标图像隐私预测的可靠性，逐样本地识别分类器生成的类概率分布的融合权值，并进行加权决策级融合，使可靠性高的分类器具有较高的融合权值，而可靠性低的分类器具有较低的融合权值。我们的实验结果表明，门控多模态融合网络有效地融合了单一模态，并且优于最先进的图像隐私预测模型。此外，我们使用输入数据(即图像和/或标签)上的多种类型噪声对我们提出的GMMF模型进行对抗性训练。当某些模态被带有噪声攻击的输入数据破坏时，我们的方法有效地利用干净模态，并利用融合权重最小化退化模态带来的负面影响，实现了比传统融合方法更强的图像隐私预测鲁棒性。我们的GMMF模型对数据噪声的鲁棒性甚至可以推广到更严重的噪声水平。据我们所知，我们是第一个研究图像隐私预测模型对噪声攻击的鲁棒性的人。此外，由于决策级多模态融合的性能高度依赖于单模态网络的质量，我们研究了单模态隐私分类器的自蒸馏，并观察到在我们提出的方法中，将知识从训练有素的教师模型转移到学生模型是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.