{"title":"Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models","authors":"Michael Günther, Isabelle Mohr, Bo Wang, Han Xiao","doi":"arxiv-2409.04701","DOIUrl":null,"url":null,"abstract":"Many use cases require retrieving smaller portions of text, and dense\nvector-based retrieval systems often perform better with shorter text segments,\nas the semantics are less likely to be \"over-compressed\" in the embeddings.\nConsequently, practitioners often split text documents into smaller chunks and\nencode them separately. However, chunk embeddings created in this way can lose\ncontextual information from surrounding chunks, resulting in suboptimal\nrepresentations. In this paper, we introduce a novel method called \"late\nchunking,\" which leverages long context embedding models to first embed all\ntokens of the long text, with chunking applied after the transformer model and\njust before mean pooling. The resulting chunk embeddings capture the full\ncontextual information, leading to superior results across various retrieval\ntasks without the need for additional training. Moreover, our method is generic\nenough to be applied to any long-context embedding model.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04701","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Many use cases require retrieving smaller portions of text, and dense
vector-based retrieval systems often perform better with shorter text segments,
as the semantics are less likely to be "over-compressed" in the embeddings.
Consequently, practitioners often split text documents into smaller chunks and
encode them separately. However, chunk embeddings created in this way can lose
contextual information from surrounding chunks, resulting in suboptimal
representations. In this paper, we introduce a novel method called "late
chunking," which leverages long context embedding models to first embed all
tokens of the long text, with chunking applied after the transformer model and
just before mean pooling. The resulting chunk embeddings capture the full
contextual information, leading to superior results across various retrieval
tasks without the need for additional training. Moreover, our method is generic
enough to be applied to any long-context embedding model.