(Un)likelihood Training for Interpretable Embedding

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Information Systems Pub Date : 2023-11-13 DOI:10.1145/3632752

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Zhijian Hou

{"title":"(Un)likelihood Training for Interpretable Embedding","authors":"Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Zhijian Hou","doi":"10.1145/3632752","DOIUrl":null,"url":null,"abstract":"Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult, if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll the semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show that the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"49 21","pages":"0"},"PeriodicalIF":5.4000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3632752","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult, if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll the semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show that the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.

查看原文本刊更多论文

可解释嵌入的(非)似然训练

跨模态表示学习已成为弥合文本和视觉数据之间语义差距的新常态。然而，在连续潜在空间中学习模态不可知表示通常被视为一个黑箱数据驱动的训练过程。众所周知，表示学习的有效性在很大程度上取决于训练数据的质量和规模。对于视频表示学习，拥有一套完整的标签来注释用于训练的全部视频内容是非常困难的，如果不是不可能的话。这些问题，黑箱训练和数据集偏差，由于无法解释和不可预测的结果，使得表示学习在视频理解中部署具有实际挑战性。在本文中，我们提出了两个新的训练目标，即似然函数和非似然函数，以揭示嵌入背后的语义，同时解决训练中的标签稀疏性问题。似然训练旨在解释超越训练标签的嵌入语义，而非似然训练利用先验知识进行正则化以确保语义连贯的解释。基于这两个训练目标，提出了一种新的编码器-解码器网络，该网络学习可解释的跨模态表示，用于自组织视频搜索。在TRECVid和MSR-VTT数据集上的大量实验表明，所提出的网络在统计上显著优于几种最先进的检索模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

14.30%

发文量

165

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain: new principled information retrieval models or algorithms with sound empirical validation; observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking; accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques; formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks; development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking; development of computational models of user information preferences and interaction behaviors; creation and analysis of evaluation methodologies for information retrieval and information seeking; or surveys of existing work that propose a significant synthesis. The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.