基于字母双字母频率的文本片段作者识别

Abdunabi A. Kosimov
{"title":"基于字母双字母频率的文本片段作者识别","authors":"Abdunabi A. Kosimov","doi":"10.17212/2782-2001-2022-1-73-82","DOIUrl":null,"url":null,"abstract":"On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.","PeriodicalId":292298,"journal":{"name":"Analysis and data processing systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the recognition of the author of a text fragment based on the frequency of alphabetic bigrams\",\"authors\":\"Abdunabi A. Kosimov\",\"doi\":\"10.17212/2782-2001-2022-1-73-82\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.\",\"PeriodicalId\":292298,\"journal\":{\"name\":\"Analysis and data processing systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Analysis and data processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17212/2782-2001-2022-1-73-82\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analysis and data processing systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17212/2782-2001-2022-1-73-82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

以塔吉克文学作品模型集为例,研究了从该集中提取的最小尺寸文本片段的作者确定可能性的问题。本文考虑了一个由古典诗歌和西里尔文字的现代散文作品组成的塔吉克语文本示范集。每件作品都与一幅数字肖像相关联——符号双字母的频率分布。为了解决识别文本作者的问题,二元图是相当可接受的定量特征。γ-分类器被用作实现该任务的工具,它允许通过字母组合元素的频率以足够高的效率识别文本信息的作者。γ-分类器的数学模型表示为三元组。它的第一个组成部分是文本的数字肖像(DP)——双字母在文本中的频率分布;第二个组件是计算DP文本之间距离的公式,第三个组件是机器学习算法。使用模型集合的所有产品之间的配对距离表对算法进行调整,包括确定实参数γ的最优值,从而使违反“均匀性”假设的误差最小化。研究还发现,在数码肖像的γ-分类器的帮助下,可以识别塔吉克语作品的作者。通过使用度量分类器和最近(就距离而言)邻居的方法,可以识别从7000个单词(40,000个字符)到20个单词(100个字符)的文本片段递减序列的作者。用于识别塔吉克文本作者的字词或符号样本的最小容量已经确定。描述了用最小的单词(字符)样本大小来识别文本作者的实验结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
On the recognition of the author of a text fragment based on the frequency of alphabetic bigrams
On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信