科学文本的小而强大的多语言编码器

IF 0.6 4区数学 Q3 MATHEMATICS

Doklady Mathematics Pub Date : 2025-03-22 DOI:10.1134/S1064562424602178

N. Gerasimenko, A. Vatolin, A. Ianina, K. Vorontsov

{"title":"科学文本的小而强大的多语言编码器","authors":"N. Gerasimenko, A. Vatolin, A. Ianina, K. Vorontsov","doi":"10.1134/S1064562424602178","DOIUrl":null,"url":null,"abstract":"<p>LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44 M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23 M parameters vs. 110 M), having approximately 2x smaller embeddings (312 vs. 768) and 2x bigger context length (1024 vs. 512). In addition to the tiny model, we also propose the SciRus-small (61 M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.</p>","PeriodicalId":531,"journal":{"name":"Doklady Mathematics","volume":"110 1 supplement","pages":"S193 - S202"},"PeriodicalIF":0.6000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1134/S1064562424602178.pdf","citationCount":"0","resultStr":"{\"title\":\"SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts\",\"authors\":\"N. Gerasimenko, A. Vatolin, A. Ianina, K. Vorontsov\",\"doi\":\"10.1134/S1064562424602178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44 M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23 M parameters vs. 110 M), having approximately 2x smaller embeddings (312 vs. 768) and 2x bigger context length (1024 vs. 512). In addition to the tiny model, we also propose the SciRus-small (61 M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.</p>\",\"PeriodicalId\":531,\"journal\":{\"name\":\"Doklady Mathematics\",\"volume\":\"110 1 supplement\",\"pages\":\"S193 - S202\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2025-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1134/S1064562424602178.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Doklady Mathematics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://link.springer.com/article/10.1134/S1064562424602178\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Doklady Mathematics","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1134/S1064562424602178","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS","Score":null,"Total":0}

引用次数: 0

摘要

基于llm的表示学习被广泛用于构建有效的信息检索系统，包括科学领域。为了使科学更加开放和负担得起，重要的是这些系统支持多语言（和跨语言）搜索，并且不需要大量的计算能力。为了解决这个问题，我们提出了一个小型的、轻量级的多语言编码器，该编码器从零开始训练44万篇研究论文的摘要（15B代币），然后使用引文数据以对比的方式进行调整。在SciRepEval基准测试中，scirecval -tiny在13/24个任务上优于科学文本的英文SOTA模型sciincl，在7个任务上达到SOTA。此外，SciRus-tiny比SciNCL更有效：它几乎小5倍（23 M参数vs. 110 M），嵌入大约小2倍（312 vs. 768），上下文长度大2倍（1024 vs. 512）。除了微型模型，我们还提出了SciRus-small （61 M参数和768个嵌入尺寸），它更强大，可用于复杂的下游任务。我们进一步研究了不同的对比预训练方法，并证明了在没有引文信息的情况下，仅使用标题-摘要对就可以获得几乎相同的SOTA结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts

LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44 M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23 M parameters vs. 110 M), having approximately 2x smaller embeddings (312 vs. 768) and 2x bigger context length (1024 vs. 512). In addition to the tiny model, we also propose the SciRus-small (61 M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Doklady Mathematics 数学-数学

CiteScore

1.00

自引率

16.70%

发文量

审稿时长

3-6 weeks

期刊介绍： Doklady Mathematics is a journal of the Presidium of the Russian Academy of Sciences. It contains English translations of papers published in Doklady Akademii Nauk (Proceedings of the Russian Academy of Sciences), which was founded in 1933 and is published 36 times a year. Doklady Mathematics includes the materials from the following areas: mathematics, mathematical physics, computer science, control theory, and computers. It publishes brief scientific reports on previously unpublished significant new research in mathematics and its applications. The main contributors to the journal are Members of the RAS, Corresponding Members of the RAS, and scientists from the former Soviet Union and other foreign countries. Among the contributors are the outstanding Russian mathematicians.