{"title":"渐进式离散化生成检索:一种高质量文档生成的自监督方法","authors":"Shunyu Yao , Jie Hu , Zhiyuan Zhang , Dan Liu","doi":"10.1016/j.neunet.2025.107663","DOIUrl":null,"url":null,"abstract":"<div><div>Generative retrieval is a novel retrieval paradigm where large language models serve as differentiable indices to memorize and retrieve candidate documents in a generative fashion. This paradigm overcomes the limitation that documents and queries must be encoded separately and demonstrates superior performance compared to traditional retrieval methods. To support the retrieval of large-scale corpora, extensive research has been devoted to devising a discrete and distinguishable document representation, namely the DocID. However, most DocIDs are built under unsupervised circumstances, where uncontrollable information distortion will be introduced during the discretization stage. In this work, we propose the <strong>S</strong>elf-supervised <strong>P</strong>rogressive <strong>D</strong>iscretization framework (SPD). SPD first distills document information into multi-perspective continuous representations in a self-supervised way. Then, a progressive discretization algorithm is employed to transform the continuous representations into approximate vectors and discrete DocIDs. The self-supervised model, approximate vectors, and DocIDs are further integrated into a query-side training pipeline to produce an effective generative retriever. Experiments on popular benchmarks demonstrate that SPD builds high-quality search-oriented DocIDs that achieve state-of-the-art generative retrieval performance.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107663"},"PeriodicalIF":6.0000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Progressive discretization for generative retrieval: A self-supervised approach to high-quality DocID generation\",\"authors\":\"Shunyu Yao , Jie Hu , Zhiyuan Zhang , Dan Liu\",\"doi\":\"10.1016/j.neunet.2025.107663\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Generative retrieval is a novel retrieval paradigm where large language models serve as differentiable indices to memorize and retrieve candidate documents in a generative fashion. This paradigm overcomes the limitation that documents and queries must be encoded separately and demonstrates superior performance compared to traditional retrieval methods. To support the retrieval of large-scale corpora, extensive research has been devoted to devising a discrete and distinguishable document representation, namely the DocID. However, most DocIDs are built under unsupervised circumstances, where uncontrollable information distortion will be introduced during the discretization stage. In this work, we propose the <strong>S</strong>elf-supervised <strong>P</strong>rogressive <strong>D</strong>iscretization framework (SPD). SPD first distills document information into multi-perspective continuous representations in a self-supervised way. Then, a progressive discretization algorithm is employed to transform the continuous representations into approximate vectors and discrete DocIDs. The self-supervised model, approximate vectors, and DocIDs are further integrated into a query-side training pipeline to produce an effective generative retriever. Experiments on popular benchmarks demonstrate that SPD builds high-quality search-oriented DocIDs that achieve state-of-the-art generative retrieval performance.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"190 \",\"pages\":\"Article 107663\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S089360802500543X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S089360802500543X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Progressive discretization for generative retrieval: A self-supervised approach to high-quality DocID generation
Generative retrieval is a novel retrieval paradigm where large language models serve as differentiable indices to memorize and retrieve candidate documents in a generative fashion. This paradigm overcomes the limitation that documents and queries must be encoded separately and demonstrates superior performance compared to traditional retrieval methods. To support the retrieval of large-scale corpora, extensive research has been devoted to devising a discrete and distinguishable document representation, namely the DocID. However, most DocIDs are built under unsupervised circumstances, where uncontrollable information distortion will be introduced during the discretization stage. In this work, we propose the Self-supervised Progressive Discretization framework (SPD). SPD first distills document information into multi-perspective continuous representations in a self-supervised way. Then, a progressive discretization algorithm is employed to transform the continuous representations into approximate vectors and discrete DocIDs. The self-supervised model, approximate vectors, and DocIDs are further integrated into a query-side training pipeline to produce an effective generative retriever. Experiments on popular benchmarks demonstrate that SPD builds high-quality search-oriented DocIDs that achieve state-of-the-art generative retrieval performance.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.