{"title":"顿河哥萨克军档案文献语料库:形态分析问题","authors":"O. Gorban, M. Kosova, E. Sheptukhina, A. Svetlov","doi":"10.15688/jvolsu2.2022.6.4","DOIUrl":null,"url":null,"abstract":"The article presents the results of the collective project aimed at comprising a special annotated diachronic corpus of documents of the 18 th – 19 th cen. from the \"Mikhailovsky Stanitsa Ataman\" Archive Fund (State Archive of Volgograd Region, Russia). In the course of the work, linguistic, technical and software tasks related to meta-marking, morphological tagging and representation of marked texts in an electronic search environment were solved. The texts are written in cursive script of the 18 th cen. with the use of the old Cyrillic letters, which have spelling specificity. To work correctly with them, an add-on to the stemming tool MyStem by I. Segalovich was created. This application adds to the MyStem the following capabilities: the option to work with the old Cyrillic symbols, a convenient graphical interface; it provides the opportunity to remove homonymy manually, enables marked text exporting to an external data storage and processing system. Morphological analysis of some texts revealed the presence of nominal case form variants, which were not noted in the \"Russian Grammar\" by M.V. Lomonosov, in modern studies of literary texts of the 18 th century. These findings point to effectiveness of automatic tagging which allows word form correction. The research results substantiated text tagging software tools adjustment for the extension of homonymous forms grammatical analysis options, aimed at identification and manual removal of homonymy. A quantitative analysis of these variants will allow the authors to evaluate their significance for the regional administrative language. The information obtained confirms the importance of the corpus creation for studying the history of the Russian language.","PeriodicalId":42545,"journal":{"name":"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie","volume":"30 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Corpus of the Archival Documents of the Don Cossack Army: Problems of Morphological Analysis\",\"authors\":\"O. Gorban, M. Kosova, E. Sheptukhina, A. Svetlov\",\"doi\":\"10.15688/jvolsu2.2022.6.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article presents the results of the collective project aimed at comprising a special annotated diachronic corpus of documents of the 18 th – 19 th cen. from the \\\"Mikhailovsky Stanitsa Ataman\\\" Archive Fund (State Archive of Volgograd Region, Russia). In the course of the work, linguistic, technical and software tasks related to meta-marking, morphological tagging and representation of marked texts in an electronic search environment were solved. The texts are written in cursive script of the 18 th cen. with the use of the old Cyrillic letters, which have spelling specificity. To work correctly with them, an add-on to the stemming tool MyStem by I. Segalovich was created. This application adds to the MyStem the following capabilities: the option to work with the old Cyrillic symbols, a convenient graphical interface; it provides the opportunity to remove homonymy manually, enables marked text exporting to an external data storage and processing system. Morphological analysis of some texts revealed the presence of nominal case form variants, which were not noted in the \\\"Russian Grammar\\\" by M.V. Lomonosov, in modern studies of literary texts of the 18 th century. These findings point to effectiveness of automatic tagging which allows word form correction. The research results substantiated text tagging software tools adjustment for the extension of homonymous forms grammatical analysis options, aimed at identification and manual removal of homonymy. A quantitative analysis of these variants will allow the authors to evaluate their significance for the regional administrative language. The information obtained confirms the importance of the corpus creation for studying the history of the Russian language.\",\"PeriodicalId\":42545,\"journal\":{\"name\":\"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15688/jvolsu2.2022.6.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15688/jvolsu2.2022.6.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
Corpus of the Archival Documents of the Don Cossack Army: Problems of Morphological Analysis
The article presents the results of the collective project aimed at comprising a special annotated diachronic corpus of documents of the 18 th – 19 th cen. from the "Mikhailovsky Stanitsa Ataman" Archive Fund (State Archive of Volgograd Region, Russia). In the course of the work, linguistic, technical and software tasks related to meta-marking, morphological tagging and representation of marked texts in an electronic search environment were solved. The texts are written in cursive script of the 18 th cen. with the use of the old Cyrillic letters, which have spelling specificity. To work correctly with them, an add-on to the stemming tool MyStem by I. Segalovich was created. This application adds to the MyStem the following capabilities: the option to work with the old Cyrillic symbols, a convenient graphical interface; it provides the opportunity to remove homonymy manually, enables marked text exporting to an external data storage and processing system. Morphological analysis of some texts revealed the presence of nominal case form variants, which were not noted in the "Russian Grammar" by M.V. Lomonosov, in modern studies of literary texts of the 18 th century. These findings point to effectiveness of automatic tagging which allows word form correction. The research results substantiated text tagging software tools adjustment for the extension of homonymous forms grammatical analysis options, aimed at identification and manual removal of homonymy. A quantitative analysis of these variants will allow the authors to evaluate their significance for the regional administrative language. The information obtained confirms the importance of the corpus creation for studying the history of the Russian language.