Multi-level Cross-modal Alignment for Image Clustering

AAAI Conference on Artificial Intelligence Pub Date : 2024-01-22 DOI:10.48550/arXiv.2401.11740

Liping Qiu, Qin Zhang, Xiaojun Chen, Shao-Qian Cai

引用次数: 0

Abstract

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-level Cross-modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.

查看原文本刊更多论文

图像聚类的多级跨模态对齐

最近，跨模态预训练模型被用来生成有意义的伪标签，以监督图像聚类模型的训练。然而，跨模态预训练模型中的大量错误配准可能会产生劣质的伪标签并降低聚类性能。为了解决上述问题，我们提出了一种新颖的多层次跨模态对齐方法，通过建立一个更小但更好的语义空间，并在三个层次（即实例层次、原型层次和语义层次）上对图像和文本进行对齐，从而改进下游任务的跨模态预训练模型中的对齐。理论结果表明，我们提出的方法是收敛的，并提出了有效的方法来降低我们方法的预期聚类风险。在五个基准数据集上的实验结果清楚地表明了我们的新方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AAAI Conference on Artificial Intelligence

自引率

0.00%

发文量