学术论文与论文文本比较的有效性评价语义相似度方法

Science Journal of University of Zakho Pub Date : 2023-08-14 DOI:10.25271/sjuoz.2023.11.3.1120

Ramadan T. Hassan, N. S. Ahmed

{"title":"学术论文与论文文本比较的有效性评价语义相似度方法","authors":"Ramadan T. Hassan, N. S. Ahmed","doi":"10.25271/sjuoz.2023.11.3.1120","DOIUrl":null,"url":null,"abstract":"Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.","PeriodicalId":21627,"journal":{"name":"Science Journal of University of Zakho","volume":"70 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS\",\"authors\":\"Ramadan T. Hassan, N. S. Ahmed\",\"doi\":\"10.25271/sjuoz.2023.11.3.1120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.\",\"PeriodicalId\":21627,\"journal\":{\"name\":\"Science Journal of University of Zakho\",\"volume\":\"70 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science Journal of University of Zakho\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25271/sjuoz.2023.11.3.1120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science Journal of University of Zakho","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25271/sjuoz.2023.11.3.1120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在自然语言处理应用中，检测文档之间的语义相似性是至关重要的。一种广泛使用的测量文本文档语义相似度的方法是嵌入，它涉及使用各种NLP方法将文本转换为数值向量。本文对论文语义相似度检测的四种嵌入方法进行了对比分析，即词频-逆文档频率、文档到向量、句子双向编码器转换表示和余弦相似度转换双向编码器表示。该研究使用了两个数据集，包括来自杜胡克理工大学的27份文件和来自ProQuest.com的100份文件。对这些文档中的文本进行预处理，使其适合语义相似度分析。对这些方法的评价基于几个指标，包括准确性、精密度、召回率、F1分数和处理时间。结果表明，传统的TF-IDF方法在嵌入和检测文档之间实际语义相似度方面优于现代方法，处理时间不超过几秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS

Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science Journal of University of Zakho

自引率

0.00%

发文量

审稿时长

6 weeks