Improving Nearest Neighbor Indexing by Multitask Learning

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2022-09-14 DOI:10.1145/3549555.3549579

Amorntip Prayoonwong, Ke Zeng, Chih-Yi Chiu

{"title":"Improving Nearest Neighbor Indexing by Multitask Learning","authors":"Amorntip Prayoonwong, Ke Zeng, Chih-Yi Chiu","doi":"10.1145/3549555.3549579","DOIUrl":null,"url":null,"abstract":"In the task of approximate nearest neighbor search, the conventional lookup-table indexing calculates the distances (or similarities) between the query and codewords, and then re-ranks the data points associated with the nearest (or the most similar) codewords. To address the codeword quantization loss problem exhibited in the conventional method, the probability-based indexing leverages the data distribution among codewords learned by neural networks to locate the nearest neighbor [8]. In this paper, we present a multitasking model to improve the probability-based indexing method. The model is formulated by two objectives of NN distribution probabilities and data retrieval quantity. The NN distribution probabilities are an estimation to determine the possible codewords where the nearest neighbor may be associated. The candidate retrieval quantity specifies the prediction for the least number of codewords to be re-ranked for capturing the nearest neighbor. The proposed model is then trained by minimizing triplet loss, probability loss, and quantity loss. By learning these tasks in parallel, we find the predictions for both data distribution probability and data retrieval quantity are more accurate, so that search accuracy and computation efficiency can be improved together. We experiment on two billion-scale benchmark datasets to evaluate the proposed method and compare with several approximate nearest neighbor search methods, and the results demonstrate the outperformance of the proposed method.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3549555.3549579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the task of approximate nearest neighbor search, the conventional lookup-table indexing calculates the distances (or similarities) between the query and codewords, and then re-ranks the data points associated with the nearest (or the most similar) codewords. To address the codeword quantization loss problem exhibited in the conventional method, the probability-based indexing leverages the data distribution among codewords learned by neural networks to locate the nearest neighbor [8]. In this paper, we present a multitasking model to improve the probability-based indexing method. The model is formulated by two objectives of NN distribution probabilities and data retrieval quantity. The NN distribution probabilities are an estimation to determine the possible codewords where the nearest neighbor may be associated. The candidate retrieval quantity specifies the prediction for the least number of codewords to be re-ranked for capturing the nearest neighbor. The proposed model is then trained by minimizing triplet loss, probability loss, and quantity loss. By learning these tasks in parallel, we find the predictions for both data distribution probability and data retrieval quantity are more accurate, so that search accuracy and computation efficiency can be improved together. We experiment on two billion-scale benchmark datasets to evaluate the proposed method and compare with several approximate nearest neighbor search methods, and the results demonstrate the outperformance of the proposed method.

查看原文本刊更多论文

多任务学习改进最近邻索引

在近似最近邻搜索任务中，传统的查询表索引计算查询和码字之间的距离(或相似度)，然后将与最近(或最相似)码字相关的数据点重新排序。为了解决传统方法中出现的码字量化损失问题，基于概率的索引利用神经网络学习到的码字之间的数据分布来定位最近的邻居[8]。在本文中，我们提出了一个多任务模型来改进基于概率的索引方法。该模型由神经网络分布概率和数据检索量两个目标组成。神经网络分布概率是一种估计，用于确定最近邻可能关联的可能码字。候选检索量指定为捕获最近邻居而重新排序的最小码字数的预测。然后通过最小化三重损失、概率损失和数量损失来训练所提出的模型。通过并行学习这些任务，我们发现对数据分布概率和数据检索量的预测更加准确，从而可以同时提高搜索精度和计算效率。我们在两个十亿尺度的基准数据集上进行了实验，对所提方法进行了评价，并与几种近似最近邻搜索方法进行了比较，结果证明了所提方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th International Conference on Content-based Multimedia Indexing

自引率

0.00%

发文量