N. Gerasimenko, A. Vatolin, A. Ianina, K. Vorontsov
{"title":"SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts","authors":"N. Gerasimenko, A. Vatolin, A. Ianina, K. Vorontsov","doi":"10.1134/S1064562424602178","DOIUrl":null,"url":null,"abstract":"<p>LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44 M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23 M parameters vs. 110 M), having approximately 2x smaller embeddings (312 vs. 768) and 2x bigger context length (1024 vs. 512). In addition to the tiny model, we also propose the SciRus-small (61 M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.</p>","PeriodicalId":531,"journal":{"name":"Doklady Mathematics","volume":"110 1 supplement","pages":"S193 - S202"},"PeriodicalIF":0.5000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1134/S1064562424602178.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Doklady Mathematics","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1134/S1064562424602178","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44 M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23 M parameters vs. 110 M), having approximately 2x smaller embeddings (312 vs. 768) and 2x bigger context length (1024 vs. 512). In addition to the tiny model, we also propose the SciRus-small (61 M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.
期刊介绍:
Doklady Mathematics is a journal of the Presidium of the Russian Academy of Sciences. It contains English translations of papers published in Doklady Akademii Nauk (Proceedings of the Russian Academy of Sciences), which was founded in 1933 and is published 36 times a year. Doklady Mathematics includes the materials from the following areas: mathematics, mathematical physics, computer science, control theory, and computers. It publishes brief scientific reports on previously unpublished significant new research in mathematics and its applications. The main contributors to the journal are Members of the RAS, Corresponding Members of the RAS, and scientists from the former Soviet Union and other foreign countries. Among the contributors are the outstanding Russian mathematicians.