{"title":"走向近代中早期威尔士语的历史树库,第一部分:工作流和词性标注","authors":"M. Meelen, David Willis","doi":"10.16922/JCL.22.6","DOIUrl":null,"url":null,"abstract":"This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.","PeriodicalId":35107,"journal":{"name":"Journal of Celtic Linguistics","volume":"8 8 1","pages":"125-154"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Towards a Historical Treebank of Middle and Early Modern Welsh, Part I: Workflow and POS Tagging\",\"authors\":\"M. Meelen, David Willis\",\"doi\":\"10.16922/JCL.22.6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.\",\"PeriodicalId\":35107,\"journal\":{\"name\":\"Journal of Celtic Linguistics\",\"volume\":\"8 8 1\",\"pages\":\"125-154\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Celtic Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.16922/JCL.22.6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Celtic Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.16922/JCL.22.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}
Towards a Historical Treebank of Middle and Early Modern Welsh, Part I: Workflow and POS Tagging
This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.
期刊介绍:
The Journal of Celtic Linguistics publishes articles and reviews on all aspects of the linguistics of the Celtic languages, modern, medieval and ancient, with particular emphasis on synchronic studies, while not excluding diachronic and comparative-historical work. Papers are invited in English on all fields/‘levels’ of analysis; phonology, morphology, syntax, semantics; formal or functional, cross-language typological or language-internal, dialectological or sociolinguistic, any theoretical paradigm.