A method for semantic textual similarity on long texts.

IF 2.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-09-19 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.3202

Omar Zatarain, Juan Carlos González-Castolo, Silvia Ramos-Cabral

{"title":"A method for semantic textual similarity on long texts.","authors":"Omar Zatarain, Juan Carlos González-Castolo, Silvia Ramos-Cabral","doi":"10.7717/peerj-cs.3202","DOIUrl":null,"url":null,"abstract":"<p><p>This work introduces a method for the semantic similarity of long documents using sentence transformers and large language models. The method detects relevant information from a pair of long texts by exploiting sentence transformers and large language models. The degree of similarity is obtained with an analytical fuzzy strategy that enables selective iterative retrieval under noisy conditions. The method discards the least similar pairs of sentences and selects the most similar. The preprocessing consists of splitting texts into sentences. The analytical strategy classifies pairs of texts by a degree of similarity without prior training on a dataset of long documents. Instead, it uses pre-trained models with any token capacity, a set of fuzzy parameters is tuned based on a few assessment iterations, and the parameters are updated based on criteria to detect four classes of similarity: identical, same topic, concept related, and non-related. This method can be employed in both small sentence transformers and large language models to detect similarity between pairs of documents of random sizes and avoid truncation of texts by testing pairs of sentences. A dataset of long texts in English from Wikipedia and other public sources, jointly with its gold standard, is provided and reviewed to test the method's performance. The method's performance is tested with small-token-size sentence transformers, large language models (LLMs), and text pairs split into sentences. Results prove that smaller sentence transformers are reliable for obtaining the similarity on long texts and indicate this method is an economical alternative to the increasing need for larger language models to find the degree of similarity between two long texts and extract the relevant information. Code and datasets are available at: https://github.com/omarzatarain/long-texts-similarity. Results of the adjustment of parameters can be found at https://doi.org/10.6084/m9.figshare.29082791.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3202"},"PeriodicalIF":2.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453783/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3202","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This work introduces a method for the semantic similarity of long documents using sentence transformers and large language models. The method detects relevant information from a pair of long texts by exploiting sentence transformers and large language models. The degree of similarity is obtained with an analytical fuzzy strategy that enables selective iterative retrieval under noisy conditions. The method discards the least similar pairs of sentences and selects the most similar. The preprocessing consists of splitting texts into sentences. The analytical strategy classifies pairs of texts by a degree of similarity without prior training on a dataset of long documents. Instead, it uses pre-trained models with any token capacity, a set of fuzzy parameters is tuned based on a few assessment iterations, and the parameters are updated based on criteria to detect four classes of similarity: identical, same topic, concept related, and non-related. This method can be employed in both small sentence transformers and large language models to detect similarity between pairs of documents of random sizes and avoid truncation of texts by testing pairs of sentences. A dataset of long texts in English from Wikipedia and other public sources, jointly with its gold standard, is provided and reviewed to test the method's performance. The method's performance is tested with small-token-size sentence transformers, large language models (LLMs), and text pairs split into sentences. Results prove that smaller sentence transformers are reliable for obtaining the similarity on long texts and indicate this method is an economical alternative to the increasing need for larger language models to find the degree of similarity between two long texts and extract the relevant information. Code and datasets are available at: https://github.com/omarzatarain/long-texts-similarity. Results of the adjustment of parameters can be found at https://doi.org/10.6084/m9.figshare.29082791.

查看原文本刊更多论文

长文本语义相似度的一种方法。

本文介绍了一种使用句子转换器和大型语言模型的长文档语义相似度的方法。该方法通过利用句子转换器和大型语言模型从一对长文本中检测相关信息。采用模糊分析策略，在噪声条件下实现选择性迭代检索，从而获得相似度。该方法丢弃最不相似的句子对，选择最相似的句子对。预处理包括将文本分成句子。该分析策略通过相似度对文本进行分类，而无需事先在长文档数据集上进行训练。相反，它使用具有任何标记容量的预训练模型，根据几次评估迭代调整一组模糊参数，并根据标准更新参数，以检测四类相似性：相同、相同主题、概念相关和不相关。该方法既可以应用于小型句子转换器，也可以应用于大型语言模型，通过对句子的测试来检测随机大小的文档对之间的相似度，避免文本的截断。本文提供了一个来自维基百科和其他公共资源的英文长文本数据集，并对其黄金标准进行了审查，以测试该方法的性能。该方法的性能通过小标记大小的句子转换器、大型语言模型（llm）和分割成句子的文本对来测试。结果证明，较小的句子转换器对于获得长文本的相似度是可靠的，并且表明该方法是一种经济的替代方法，以满足越来越需要更大的语言模型来查找两个长文本之间的相似程度并提取相关信息。代码和数据集可在：https://github.com/omarzatarain/long-texts-similarity。参数调整结果可在https://doi.org/10.6084/m9.figshare.29082791上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.