Omar Zatarain, Juan Carlos González-Castolo, Silvia Ramos-Cabral
{"title":"A method for semantic textual similarity on long texts.","authors":"Omar Zatarain, Juan Carlos González-Castolo, Silvia Ramos-Cabral","doi":"10.7717/peerj-cs.3202","DOIUrl":null,"url":null,"abstract":"<p><p>This work introduces a method for the semantic similarity of long documents using sentence transformers and large language models. The method detects relevant information from a pair of long texts by exploiting sentence transformers and large language models. The degree of similarity is obtained with an analytical fuzzy strategy that enables selective iterative retrieval under noisy conditions. The method discards the least similar pairs of sentences and selects the most similar. The preprocessing consists of splitting texts into sentences. The analytical strategy classifies pairs of texts by a degree of similarity without prior training on a dataset of long documents. Instead, it uses pre-trained models with any token capacity, a set of fuzzy parameters is tuned based on a few assessment iterations, and the parameters are updated based on criteria to detect four classes of similarity: identical, same topic, concept related, and non-related. This method can be employed in both small sentence transformers and large language models to detect similarity between pairs of documents of random sizes and avoid truncation of texts by testing pairs of sentences. A dataset of long texts in English from Wikipedia and other public sources, jointly with its gold standard, is provided and reviewed to test the method's performance. The method's performance is tested with small-token-size sentence transformers, large language models (LLMs), and text pairs split into sentences. Results prove that smaller sentence transformers are reliable for obtaining the similarity on long texts and indicate this method is an economical alternative to the increasing need for larger language models to find the degree of similarity between two long texts and extract the relevant information. Code and datasets are available at: https://github.com/omarzatarain/long-texts-similarity. Results of the adjustment of parameters can be found at https://doi.org/10.6084/m9.figshare.29082791.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3202"},"PeriodicalIF":2.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453783/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3202","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This work introduces a method for the semantic similarity of long documents using sentence transformers and large language models. The method detects relevant information from a pair of long texts by exploiting sentence transformers and large language models. The degree of similarity is obtained with an analytical fuzzy strategy that enables selective iterative retrieval under noisy conditions. The method discards the least similar pairs of sentences and selects the most similar. The preprocessing consists of splitting texts into sentences. The analytical strategy classifies pairs of texts by a degree of similarity without prior training on a dataset of long documents. Instead, it uses pre-trained models with any token capacity, a set of fuzzy parameters is tuned based on a few assessment iterations, and the parameters are updated based on criteria to detect four classes of similarity: identical, same topic, concept related, and non-related. This method can be employed in both small sentence transformers and large language models to detect similarity between pairs of documents of random sizes and avoid truncation of texts by testing pairs of sentences. A dataset of long texts in English from Wikipedia and other public sources, jointly with its gold standard, is provided and reviewed to test the method's performance. The method's performance is tested with small-token-size sentence transformers, large language models (LLMs), and text pairs split into sentences. Results prove that smaller sentence transformers are reliable for obtaining the similarity on long texts and indicate this method is an economical alternative to the increasing need for larger language models to find the degree of similarity between two long texts and extract the relevant information. Code and datasets are available at: https://github.com/omarzatarain/long-texts-similarity. Results of the adjustment of parameters can be found at https://doi.org/10.6084/m9.figshare.29082791.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.