结构主义传统与经验数据的相遇:语料库数据对捷克网络语言参考书的增强

IF 0.7 0 LANGUAGE & LINGUISTICS
Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík
{"title":"结构主义传统与经验数据的相遇:语料库数据对捷克网络语言参考书的增强","authors":"Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík","doi":"10.3366/word.2023.0230","DOIUrl":null,"url":null,"abstract":"This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.","PeriodicalId":43166,"journal":{"name":"Word Structure","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book\",\"authors\":\"Dominika Kováříková, Martin Beneš, Kamila Smejkalová, Oleg Kovářík\",\"doi\":\"10.3366/word.2023.0230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.\",\"PeriodicalId\":43166,\"journal\":{\"name\":\"Word Structure\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Word Structure\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3366/word.2023.0230\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Word Structure","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3366/word.2023.0230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

摘要

本文展示了语料库语法工具GramatiKat如何改进和完善网络语言工具书(ILRB)中的形态信息,该工具书提供了45,632个标准捷克语名词的完整变格范式。范式表主要基于形态类型,遵循语言作为一个完全铰接系统的结构主义概念。本文讨论了如何更新语料库,为用户提供基于经验的语料库范式中每个单元中单个词形的语法信息。使用GramatiKat工具研究捷克语的语法类别,对所有名词引理进行了调查。该工具观察特定词素的词形分布,并与整个词类的标准分布进行比较。它能够识别在某种词形中出现频率异常高的名词,以及具有未经证实的词形的名词。GramatiKat基于两个捷克语语料库SYN2015和SYN2020(2亿个单词标记)的数据。本文探讨了语言的缺陷和过剩与语言的变异性和潜能之间的关系。基于ILRB和GramatiKat数据的独特组合,本文提出了如何指出异常频繁或过多的词形信息以及未经证实的词形信息,以便ILRB为用户提供准确的、基于经验的数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book
This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Word Structure
Word Structure LANGUAGE & LINGUISTICS-
CiteScore
1.60
自引率
0.00%
发文量
10
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信