A Dual Approach to Establishing the Authority of Technical Natural Language Texts and Their Components

Science and Transport Progress Pub Date : 2023-06-05 DOI:10.15802/stp2023/288958

V. Shynkarenko, I. Demidovich, O. S. Kuropiatnyk

{"title":"A Dual Approach to Establishing the Authority of Technical Natural Language Texts and Their Components","authors":"V. Shynkarenko, I. Demidovich, O. S. Kuropiatnyk","doi":"10.15802/stp2023/288958","DOIUrl":null,"url":null,"abstract":"Purpose. The study is aimed at testing the hypothesis that it is possible to determine plagiarism by methods of establishing the authorship of a text without using a text bank and their direct comparison. Methodology. Constructive and productive models of the processes of establishing the authorship of technical texts for two methods have been developed. The first method is based on the formation of a text model in the form of a set of formal substitution rules with probabilistic weights (as in stochastic formal grammars), which reflects the syntactic features and patterns of text formation by the author. The degree of similarity between the text under study and another text is determined by comparing their models. The second method is a classical approach to detecting borrowings (plagiarism) by directly comparing the text under study with an existing text bank, highlighting repeated text fragments, and determining the degree of originality. Experiments were conducted to establish the correlation between the results of these two methods. The experimental base consisted of 509 text sections of theses of students majoring in «Software Engineering». Findings. Experimental studies have made it possible to establish a high correlation between the results of the two methods. Correlation coefficients in the range of 0.75...1.0 and with an average value of 0.88 were obtained provided that borrowings are taken into account for text fragments of at least five words in length. Originality. For the first time, the authors have identified the possibilities and proposed methods for indirect plagiarism detection without using a large text bank. The essence of the model is to formalize the representation of the author's sentence syntax by a set of substitution rules with probabilistic weights. Practical value. Based on the results obtained, the possibilities for detecting borrowings have been expanded and the effectiveness of the corresponding methods has been increased. Recommendations on the parameters of classical methods for detecting borrowings have been obtained, in particular, it is recommended to take into account text fragments of at least five words in length as a rational parameter when using borrowing detection systems. The possibilities of text authorship detection methods tested on fiction texts are extended to technical texts.","PeriodicalId":338885,"journal":{"name":"Science and Transport Progress","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science and Transport Progress","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15802/stp2023/288958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose. The study is aimed at testing the hypothesis that it is possible to determine plagiarism by methods of establishing the authorship of a text without using a text bank and their direct comparison. Methodology. Constructive and productive models of the processes of establishing the authorship of technical texts for two methods have been developed. The first method is based on the formation of a text model in the form of a set of formal substitution rules with probabilistic weights (as in stochastic formal grammars), which reflects the syntactic features and patterns of text formation by the author. The degree of similarity between the text under study and another text is determined by comparing their models. The second method is a classical approach to detecting borrowings (plagiarism) by directly comparing the text under study with an existing text bank, highlighting repeated text fragments, and determining the degree of originality. Experiments were conducted to establish the correlation between the results of these two methods. The experimental base consisted of 509 text sections of theses of students majoring in «Software Engineering». Findings. Experimental studies have made it possible to establish a high correlation between the results of the two methods. Correlation coefficients in the range of 0.75...1.0 and with an average value of 0.88 were obtained provided that borrowings are taken into account for text fragments of at least five words in length. Originality. For the first time, the authors have identified the possibilities and proposed methods for indirect plagiarism detection without using a large text bank. The essence of the model is to formalize the representation of the author's sentence syntax by a set of substitution rules with probabilistic weights. Practical value. Based on the results obtained, the possibilities for detecting borrowings have been expanded and the effectiveness of the corresponding methods has been increased. Recommendations on the parameters of classical methods for detecting borrowings have been obtained, in particular, it is recommended to take into account text fragments of at least five words in length as a rational parameter when using borrowing detection systems. The possibilities of text authorship detection methods tested on fiction texts are extended to technical texts.

查看原文本刊更多论文

确立技术自然语言文本及其组成部分权威性的双重方法

研究目的本研究旨在验证以下假设：不使用文本库和直接对比，通过确定文本作者的方法就有可能判定抄袭行为。研究方法。为两种方法建立了确定技术文本作者身份过程的建设性和生产性模型。第一种方法的基础是以一组具有概率权重的形式替换规则（如随机形式语法）的形式形成文本模型，该模型反映了作者形成文本的句法特征和模式。所研究文本与另一文本的相似程度通过比较两者的模型来确定。第二种方法是检测借用（抄袭）的经典方法，即直接将所研究的文本与现有文本库进行比较，突出显示重复的文本片段，并确定原创程度。为了确定这两种方法结果之间的相关性，我们进行了实验。实验基础包括 "软件工程 "专业学生论文中的 509 个文本部分。实验结果通过实验研究，可以确定这两种方法的结果之间具有很高的相关性。相关系数在 0.75...1.0 之间，平均值为 0.88，前提是至少五个单词长度的文本片段考虑了借用。原创性。作者首次提出了不使用大型文本库间接检测抄袭的可能性和方法。该模型的精髓在于通过一组具有概率权重的替换规则，将作者的句子语法表述形式化。实用价值。基于所获得的结果，检测借用的可能性得到了扩展，相应方法的有效性也得到了提高。我们还获得了关于借用检测经典方法参数的建议，特别是建议在使用借用检测系统时，将长度至少为五个字的文本片段作为合理参数加以考虑。在小说文本中测试的文本作者检测方法的可能性被扩展到技术文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Science and Transport Progress

自引率

0.00%

发文量