{"title":"数据驱动的俄语网络爬行语料库拉丁短语识别方法","authors":"V. Benko, K. Rausova","doi":"10.17586/2541-9781-2020-4-11-20","DOIUrl":null,"url":null,"abstract":"Latin phrases are an integral part of the language of educated speakers in many (European) languages. Besides lexical units of Latin origin that have been already adapted to the orthography of the respective host language and calques, phrases retaining the original form and orthography can also be found in many texts. Due to the rather low frequency of the phenomenon, however, any systematic attempt of its analysis was a real challenge before the advent of very large (multi-Gigaword) corpora. Our paper presents a method of semi-automatic detection of Latin phrases in a Russian web corpus based on applying a Latin tagger and a series of filtrations performed by standard Linux utilities. The preliminary analysis of the resulting candidate list is shown in the concluding part of the paper.","PeriodicalId":226779,"journal":{"name":"Intelligent Memory Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data-Driven Approach to Identification of Latin Phrases in Russian Web-Crawled Corpora\",\"authors\":\"V. Benko, K. Rausova\",\"doi\":\"10.17586/2541-9781-2020-4-11-20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Latin phrases are an integral part of the language of educated speakers in many (European) languages. Besides lexical units of Latin origin that have been already adapted to the orthography of the respective host language and calques, phrases retaining the original form and orthography can also be found in many texts. Due to the rather low frequency of the phenomenon, however, any systematic attempt of its analysis was a real challenge before the advent of very large (multi-Gigaword) corpora. Our paper presents a method of semi-automatic detection of Latin phrases in a Russian web corpus based on applying a Latin tagger and a series of filtrations performed by standard Linux utilities. The preliminary analysis of the resulting candidate list is shown in the concluding part of the paper.\",\"PeriodicalId\":226779,\"journal\":{\"name\":\"Intelligent Memory Systems\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Memory Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17586/2541-9781-2020-4-11-20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17586/2541-9781-2020-4-11-20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data-Driven Approach to Identification of Latin Phrases in Russian Web-Crawled Corpora
Latin phrases are an integral part of the language of educated speakers in many (European) languages. Besides lexical units of Latin origin that have been already adapted to the orthography of the respective host language and calques, phrases retaining the original form and orthography can also be found in many texts. Due to the rather low frequency of the phenomenon, however, any systematic attempt of its analysis was a real challenge before the advent of very large (multi-Gigaword) corpora. Our paper presents a method of semi-automatic detection of Latin phrases in a Russian web corpus based on applying a Latin tagger and a series of filtrations performed by standard Linux utilities. The preliminary analysis of the resulting candidate list is shown in the concluding part of the paper.