jina-embeddings-v3: Multilingual Embeddings With Task LoRA

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI:arxiv-2409.10173

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao

引用次数: 0

Abstract

We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

查看原文本刊更多论文

jina-embeddings-v3：带任务 LoRA 的多语言嵌入法

我们介绍了 jina-embeddings-v3，这是一种拥有 5.7 亿个参数的新型文本嵌入模型，在多语言数据和长文本检索任务中实现了最先进的性能，支持高达 8192 个字节的上下文长度。该模型包括一组特定任务的低库适配（Low-Rank Adaptation，LoRA）适配器，可为查询-文档检索、聚类、分类和文本匹配生成高质量的嵌入。在 MTEB 基准测试中的评估结果表明，jina-embeddings-v3 在英语任务中的表现优于 OpenAI 和 Cohere 的最新专有嵌入式模型，而在所有多语言任务中的表现则优于 multilingual-e5-large-instruct。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Information Retrieval

自引率

0.00%

发文量