{"title":"图像隐私预测的深门控多模态融合","authors":"Chenye Zhao, Cornelia Caragea","doi":"https://dl.acm.org/doi/10.1145/3608446","DOIUrl":null,"url":null,"abstract":"<p>With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.</p>","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"42 36","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Gated Multi-modal Fusion for Image Privacy Prediction\",\"authors\":\"Chenye Zhao, Cornelia Caragea\",\"doi\":\"https://dl.acm.org/doi/10.1145/3608446\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.</p>\",\"PeriodicalId\":50940,\"journal\":{\"name\":\"ACM Transactions on the Web\",\"volume\":\"42 36\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on the Web\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3608446\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3608446","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Deep Gated Multi-modal Fusion for Image Privacy Prediction
With the rapid development of technologies in mobile devices, people can post their daily lives on social networking sites such as Facebook, Flickr, and Instagram. This leads to new privacy concerns due to people’s lack of understanding that private information can be leaked and used to their detriment. Image privacy prediction models are developed to predict whether images contain sensitive information (private images) or are safe to be shared online (public images). Despite significant progress on this task, there are still some crucial problems that remain to be solved. Firstly, images’ content and tags are found to be useful modalities to automatically predict images’ privacy. To date, most image privacy prediction models use single modalities (image-only or tag-only), which limits their performance. Secondly, we observe that current image privacy prediction models are surprisingly vulnerable to even small perturbations in the input data. Attackers can add small perturbations to input data and easily damage a well-trained image privacy prediction model. To address these challenges, in this paper, we propose a new decision-level Gated multi-modal fusion (GMMF) approach that fuses object, scene, and image tags modalities to predict privacy for online images. In particular, the proposed approach identifies fusion weights of class probability distributions generated by single-modal classifiers according to their reliability of the privacy prediction for each target image in a sample-by-sample manner and performs a weighted decision-level fusion, so that modalities with high reliability are assigned with higher fusion weights while ones with low reliability are restrained with lower fusion weights. The results of our experiments show that the gated multi-modal fusion network effectively fuses single modalities and outperforms state-of-the-art models for image privacy prediction. Moreover, we perform adversarial training on our proposed GMMF model using multiple types of noise on input data (i.e., images and/or tags). When some modalities are failed by input data with noise attacks, our approach effectively utilizes clean modalities and minimizes negative influences brought by degraded ones using fusion weights, achieving significantly stronger robustness over traditional fusion methods for image privacy prediction. The robustness of our GMMF model against data noise can even be generalized to more severe noise levels. To the best of our knowledge, we are the first to investigate the robustness of image privacy prediction models against noise attacks. Moreover, as the performance of decision-level multi-modal fusion depends highly on the quality of single-modal networks, we investigate self-distillation on single-modal privacy classifiers and observe that transferring knowledge from a trained teacher model to a student model is beneficial in our proposed approach.
期刊介绍:
Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML.
In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces.
Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.