{"title":"混合卷积自编码器-分层聚类算法揭示图像垃圾源","authors":"Yongiin Lu, Wei-bang Chen, Zanyah Ailsworth, Xiaoliang Wang, Chengcui Zhang, Kaixuan Li","doi":"10.1109/IRI58017.2023.00013","DOIUrl":null,"url":null,"abstract":"We propose a novel hybrid algorithm framework to address the problem of clustering images received in spam emails based on authorship. The multimodal nature of these images, containing foreground objects, text, or a combination of both, poses a significant challenge for grouping them effectively. To address this challenge, we train convolutional autoencoders (CAE) to extract visual features from the images, which are produced by the encoder of the trained CAEs. Furthermore, we utilize an optical character recognition (OCR) algorithm to extract text information from the images. The extracted text and visual features, in conjunction with layout features, are employed to construct matrices that measure the similarities between each pair of images in our experiment dataset. We subsequently apply a two-stage hierarchical clustering algorithm to cluster the images into groups. We compare the results produced by our proposed algorithm with the ground truth collected by a domain expert. Our experimental findings reveal that our relatively simple CAEs, with as few as thirty-seven visual features, can achieve homogeneity, completeness, and V-measures that are as high as those obtained from more complex convolutional neural networks (CNNs).","PeriodicalId":290818,"journal":{"name":"2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hybrid Convolutional Autoencoder-Hierarchical Clustering Algorithm To Reveal Image Spam Sources\",\"authors\":\"Yongiin Lu, Wei-bang Chen, Zanyah Ailsworth, Xiaoliang Wang, Chengcui Zhang, Kaixuan Li\",\"doi\":\"10.1109/IRI58017.2023.00013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a novel hybrid algorithm framework to address the problem of clustering images received in spam emails based on authorship. The multimodal nature of these images, containing foreground objects, text, or a combination of both, poses a significant challenge for grouping them effectively. To address this challenge, we train convolutional autoencoders (CAE) to extract visual features from the images, which are produced by the encoder of the trained CAEs. Furthermore, we utilize an optical character recognition (OCR) algorithm to extract text information from the images. The extracted text and visual features, in conjunction with layout features, are employed to construct matrices that measure the similarities between each pair of images in our experiment dataset. We subsequently apply a two-stage hierarchical clustering algorithm to cluster the images into groups. We compare the results produced by our proposed algorithm with the ground truth collected by a domain expert. Our experimental findings reveal that our relatively simple CAEs, with as few as thirty-seven visual features, can achieve homogeneity, completeness, and V-measures that are as high as those obtained from more complex convolutional neural networks (CNNs).\",\"PeriodicalId\":290818,\"journal\":{\"name\":\"2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI58017.2023.00013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI58017.2023.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hybrid Convolutional Autoencoder-Hierarchical Clustering Algorithm To Reveal Image Spam Sources
We propose a novel hybrid algorithm framework to address the problem of clustering images received in spam emails based on authorship. The multimodal nature of these images, containing foreground objects, text, or a combination of both, poses a significant challenge for grouping them effectively. To address this challenge, we train convolutional autoencoders (CAE) to extract visual features from the images, which are produced by the encoder of the trained CAEs. Furthermore, we utilize an optical character recognition (OCR) algorithm to extract text information from the images. The extracted text and visual features, in conjunction with layout features, are employed to construct matrices that measure the similarities between each pair of images in our experiment dataset. We subsequently apply a two-stage hierarchical clustering algorithm to cluster the images into groups. We compare the results produced by our proposed algorithm with the ground truth collected by a domain expert. Our experimental findings reveal that our relatively simple CAEs, with as few as thirty-seven visual features, can achieve homogeneity, completeness, and V-measures that are as high as those obtained from more complex convolutional neural networks (CNNs).