二元草图的交替优化方案

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2025-05-10 DOI:10.1016/j.is.2025.102563

Erik Thordsen, Erich Schubert

{"title":"二元草图的交替优化方案","authors":"Erik Thordsen, Erich Schubert","doi":"10.1016/j.is.2025.102563","DOIUrl":null,"url":null,"abstract":"<div><div>Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102563"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Alternating Optimization Scheme for Binary Sketches\",\"authors\":\"Erik Thordsen, Erich Schubert\",\"doi\":\"10.1016/j.is.2025.102563\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.</div></div>\",\"PeriodicalId\":50363,\"journal\":{\"name\":\"Information Systems\",\"volume\":\"133 \",\"pages\":\"Article 102563\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S030643792500047X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S030643792500047X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在本质上高维的数据集中搜索相似的对象是一项具有挑战性的任务。紧凑草图的使用已经提出了更快的相似性搜索使用线性扫描。二进制草图就是这样一种方法，可以找到从原始数据空间到固定长度的位串的良好映射。这些位串可以使用少量的异或和位计数操作进行有效的比较，用便宜的近似值代替昂贵的相似性计算。提出了一种初始化和改进二元草图的方案，用于欧几里得空间的相似性搜索。我们的优化通过一种正交化形式迭代地提高了草图的质量。我们提供的经验证据表明，草图的质量有一个峰值，超过这个峰值，它既不与比特独立性相关，也不与比特平衡相关，这与文献中先前的假设相矛盾。以噪声形式加入训练数据的正则化可以将峰值变为平台，并以随机方式应用优化，即在较小的数据子集上进行训练，允许快速初始化。我们提供了一个损失函数，允许使用PyTorch等神经网络框架近似相同的目标，将方法提升到基于gpu的训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Alternating Optimization Scheme for Binary Sketches

Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.