{"title":"基于字母双字母频率的文本片段作者识别","authors":"Abdunabi A. Kosimov","doi":"10.17212/2782-2001-2022-1-73-82","DOIUrl":null,"url":null,"abstract":"On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.","PeriodicalId":292298,"journal":{"name":"Analysis and data processing systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the recognition of the author of a text fragment based on the frequency of alphabetic bigrams\",\"authors\":\"Abdunabi A. Kosimov\",\"doi\":\"10.17212/2782-2001-2022-1-73-82\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.\",\"PeriodicalId\":292298,\"journal\":{\"name\":\"Analysis and data processing systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Analysis and data processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17212/2782-2001-2022-1-73-82\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analysis and data processing systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17212/2782-2001-2022-1-73-82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the recognition of the author of a text fragment based on the frequency of alphabetic bigrams
On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.