检查乌克兰文本抄袭的软件系统

NaUKMA Research Papers. Computer Science Pub Date : 2023-02-24 DOI:10.18523/2617-3808.2022.5.16-25

A. Hlybovets, Mykola Bikchentaev

{"title":"检查乌克兰文本抄袭的软件系统","authors":"A. Hlybovets, Mykola Bikchentaev","doi":"10.18523/2617-3808.2022.5.16-25","DOIUrl":null,"url":null,"abstract":"The purpose of this work is to describe the methodology of building a software system (application) for plagiarism checking of scientific publications in the Ukrainian language using two machine learning models, Word2Vec and BERT. We consider the detection of external plagiarism in Ukrainian texts.Plagiarism is usually defined as the passing off someone else’s ideas as your own. As the Internet becomes more and more accessible every day, a huge amount of data becomes available to people. Nowadays, it is quite easy to find a suitable study and plagiarize it instead of developing one’s own from scratch.Plagiarism undermines the efforts of the researcher whose work has been plagiarized and gives the plagiarist the opportunity to over-praise himself; such a person can be detrimental when appointed to an important position.Many fields of life are susceptible to plagiarism, including research and education. Plagiarism can also take many forms: from straight up copy-paste to paraphrasing and sentence restructuring. This makes plagiarism a rather complex problem, where methods, such as longest common subsequence or n-grams, based on finding shared words between documents, might not work. Therefore, we might consider applying deep learning to the problem of plagiarism detection.In this article we discussed the concept of plagiarism and listed its types. Two machine learning models have been proposed for plagiarism detection: Word2Vec and BERT. We also provided an overview of both models and described how they could be used in the problem of plagiarism detection.A web application for plagiarism detection in the Ukrainian language has been developed. This application features React, a JavaScript framework, on the frontend and Python on the backend. To store application data, MongoDB is used.This application allows a user to input a text that will be compared with the texts from the application database using cosine similarity or Euclidean distance as metrics. Comparison is performed using word embeddings, calculated by pre-trained BERT or Word2Vec model. A user can choose the model and similarity metrics using the application’s UI.The application can be further improved to not only output similarity metric but also highlight the similar sentences in the texts.","PeriodicalId":433538,"journal":{"name":"NaUKMA Research Papers. Computer Science","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Software System of Checking for Plagiarism of Ukrainian Texts\",\"authors\":\"A. Hlybovets, Mykola Bikchentaev\",\"doi\":\"10.18523/2617-3808.2022.5.16-25\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The purpose of this work is to describe the methodology of building a software system (application) for plagiarism checking of scientific publications in the Ukrainian language using two machine learning models, Word2Vec and BERT. We consider the detection of external plagiarism in Ukrainian texts.Plagiarism is usually defined as the passing off someone else’s ideas as your own. As the Internet becomes more and more accessible every day, a huge amount of data becomes available to people. Nowadays, it is quite easy to find a suitable study and plagiarize it instead of developing one’s own from scratch.Plagiarism undermines the efforts of the researcher whose work has been plagiarized and gives the plagiarist the opportunity to over-praise himself; such a person can be detrimental when appointed to an important position.Many fields of life are susceptible to plagiarism, including research and education. Plagiarism can also take many forms: from straight up copy-paste to paraphrasing and sentence restructuring. This makes plagiarism a rather complex problem, where methods, such as longest common subsequence or n-grams, based on finding shared words between documents, might not work. Therefore, we might consider applying deep learning to the problem of plagiarism detection.In this article we discussed the concept of plagiarism and listed its types. Two machine learning models have been proposed for plagiarism detection: Word2Vec and BERT. We also provided an overview of both models and described how they could be used in the problem of plagiarism detection.A web application for plagiarism detection in the Ukrainian language has been developed. This application features React, a JavaScript framework, on the frontend and Python on the backend. To store application data, MongoDB is used.This application allows a user to input a text that will be compared with the texts from the application database using cosine similarity or Euclidean distance as metrics. Comparison is performed using word embeddings, calculated by pre-trained BERT or Word2Vec model. A user can choose the model and similarity metrics using the application’s UI.The application can be further improved to not only output similarity metric but also highlight the similar sentences in the texts.\",\"PeriodicalId\":433538,\"journal\":{\"name\":\"NaUKMA Research Papers. Computer Science\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NaUKMA Research Papers. Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18523/2617-3808.2022.5.16-25\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NaUKMA Research Papers. Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18523/2617-3808.2022.5.16-25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

这项工作的目的是描述使用两个机器学习模型Word2Vec和BERT构建一个软件系统(应用程序)的方法，用于乌克兰语科学出版物的剽窃检查。我们考虑在乌克兰文本外部抄袭的检测。剽窃通常被定义为把别人的想法冒充自己的。随着互联网每天变得越来越容易访问，人们可以获得大量的数据。如今，很容易找到一个合适的研究和剽窃，而不是从头开始发展自己的。抄袭破坏了研究人员的努力，并给了剽窃者过度赞扬自己的机会;当这样的人被任命为重要职位时，可能是有害的。生活的许多领域都容易受到抄袭的影响，包括研究和教育。抄袭也可以采取多种形式:从直接复制粘贴到释义和句子重组。这使得剽窃成为一个相当复杂的问题，在这种情况下，基于查找文档之间的共享单词的方法，如最长公共子序列或n-grams，可能不起作用。因此，我们可以考虑将深度学习应用于剽窃检测问题。在这篇文章中，我们讨论了剽窃的概念，并列出了它的类型。已经提出了两种用于剽窃检测的机器学习模型:Word2Vec和BERT。我们还提供了这两个模型的概述，并描述了它们如何用于剽窃检测问题。已经开发了乌克兰语的抄袭检测网络应用程序。这个应用程序的前端是React(一个JavaScript框架)，后端是Python。使用MongoDB存储应用程序数据。此应用程序允许用户输入文本，该文本将使用余弦相似度或欧几里得距离作为度量与应用程序数据库中的文本进行比较。使用词嵌入进行比较，由预训练的BERT或Word2Vec模型计算。用户可以使用应用程序的UI选择模型和相似度指标。该应用程序可以进一步改进，不仅可以输出相似度度量，还可以突出显示文本中的相似句子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Software System of Checking for Plagiarism of Ukrainian Texts

The purpose of this work is to describe the methodology of building a software system (application) for plagiarism checking of scientific publications in the Ukrainian language using two machine learning models, Word2Vec and BERT. We consider the detection of external plagiarism in Ukrainian texts.Plagiarism is usually defined as the passing off someone else’s ideas as your own. As the Internet becomes more and more accessible every day, a huge amount of data becomes available to people. Nowadays, it is quite easy to find a suitable study and plagiarize it instead of developing one’s own from scratch.Plagiarism undermines the efforts of the researcher whose work has been plagiarized and gives the plagiarist the opportunity to over-praise himself; such a person can be detrimental when appointed to an important position.Many fields of life are susceptible to plagiarism, including research and education. Plagiarism can also take many forms: from straight up copy-paste to paraphrasing and sentence restructuring. This makes plagiarism a rather complex problem, where methods, such as longest common subsequence or n-grams, based on finding shared words between documents, might not work. Therefore, we might consider applying deep learning to the problem of plagiarism detection.In this article we discussed the concept of plagiarism and listed its types. Two machine learning models have been proposed for plagiarism detection: Word2Vec and BERT. We also provided an overview of both models and described how they could be used in the problem of plagiarism detection.A web application for plagiarism detection in the Ukrainian language has been developed. This application features React, a JavaScript framework, on the frontend and Python on the backend. To store application data, MongoDB is used.This application allows a user to input a text that will be compared with the texts from the application database using cosine similarity or Euclidean distance as metrics. Comparison is performed using word embeddings, calculated by pre-trained BERT or Word2Vec model. A user can choose the model and similarity metrics using the application’s UI.The application can be further improved to not only output similarity metric but also highlight the similar sentences in the texts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NaUKMA Research Papers. Computer Science

自引率

0.00%

发文量