{"title":"密集型和稀疏型检索器的操作建议:HNSW、平指数还是倒指数?","authors":"Jimmy Lin","doi":"arxiv-2409.06464","DOIUrl":null,"url":null,"abstract":"Practitioners working on dense retrieval today face a bewildering number of\nchoices. Beyond selecting the embedding model, another consequential choice is\nthe actual implementation of nearest-neighbor vector search. While best\npractices recommend HNSW indexes, flat vector indexes with brute-force search\nrepresent another viable option, particularly for smaller corpora and for rapid\nprototyping. In this paper, we provide experimental results on the BEIR dataset\nusing the open-source Lucene search library that explicate the tradeoffs\nbetween HNSW and flat indexes (including quantized variants) from the\nperspectives of indexing time, query evaluation performance, and retrieval\nquality. With additional comparisons between dense and sparse retrievers, our\nresults provide guidance for today's search practitioner in understanding the\ndesign space of dense and sparse retrievers. To our knowledge, we are the first\nto provide operational advice supported by empirical experiments in this\nregard.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?\",\"authors\":\"Jimmy Lin\",\"doi\":\"arxiv-2409.06464\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Practitioners working on dense retrieval today face a bewildering number of\\nchoices. Beyond selecting the embedding model, another consequential choice is\\nthe actual implementation of nearest-neighbor vector search. While best\\npractices recommend HNSW indexes, flat vector indexes with brute-force search\\nrepresent another viable option, particularly for smaller corpora and for rapid\\nprototyping. In this paper, we provide experimental results on the BEIR dataset\\nusing the open-source Lucene search library that explicate the tradeoffs\\nbetween HNSW and flat indexes (including quantized variants) from the\\nperspectives of indexing time, query evaluation performance, and retrieval\\nquality. With additional comparisons between dense and sparse retrievers, our\\nresults provide guidance for today's search practitioner in understanding the\\ndesign space of dense and sparse retrievers. To our knowledge, we are the first\\nto provide operational advice supported by empirical experiments in this\\nregard.\",\"PeriodicalId\":501281,\"journal\":{\"name\":\"arXiv - CS - Information Retrieval\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06464\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?
Practitioners working on dense retrieval today face a bewildering number of
choices. Beyond selecting the embedding model, another consequential choice is
the actual implementation of nearest-neighbor vector search. While best
practices recommend HNSW indexes, flat vector indexes with brute-force search
represent another viable option, particularly for smaller corpora and for rapid
prototyping. In this paper, we provide experimental results on the BEIR dataset
using the open-source Lucene search library that explicate the tradeoffs
between HNSW and flat indexes (including quantized variants) from the
perspectives of indexing time, query evaluation performance, and retrieval
quality. With additional comparisons between dense and sparse retrievers, our
results provide guidance for today's search practitioner in understanding the
design space of dense and sparse retrievers. To our knowledge, we are the first
to provide operational advice supported by empirical experiments in this
regard.