A corpus-based approach to reevaluation of Croatian verb classification

INFuture2019: Knowledge in the Digital Age Pub Date : 1900-01-01 DOI:10.17234/infuture.2019.6

Danijel Blazsetin, Petra Bago

{"title":"A corpus-based approach to reevaluation of Croatian verb classification","authors":"Danijel Blazsetin, Petra Bago","doi":"10.17234/infuture.2019.6","DOIUrl":null,"url":null,"abstract":"Summary Croatian grammar textbooks have a long tradition of classifying verbs based on their morphosyntactic characteristics. Conclusions, such as the frequency or productiveness of a class, were drawn without having the insight into a big corpus. Corpora used in such descriptions were not described and were presumably made of literary works which is, in our opinion, describing a form of the Croatian language distant from its everyday use. The corpus used for analyzing verbs in this paper is hrWaC which contains 1.9 billion tokens and about 90,000 verbs. This corpus was selected with the intention of describing and analyzing a less formal and less standardized language This paper offers a corpus-based approach to the problem of verb classification and emphasizes the importance of NLP methods in the process of classification as they fasten and simplify it. The paper gives a brief introduction to verbs, their morphological characteristics and their classification. By extracting verbs from the Croatian web corpus hrWaC and processing them computationally, the paper gives an insight into the verb distribution in the Croatian language and points out some difficulties that were encountered during this study. Even though this paper aimed to reevaluate the existing data data, the present findings mostly confirm the claims of previous researches. A number of recommendations for future research are given, foremost, the need of the extension of the language material.","PeriodicalId":286092,"journal":{"name":"INFuture2019: Knowledge in the Digital Age","volume":"1146 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"INFuture2019: Knowledge in the Digital Age","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17234/infuture.2019.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Summary Croatian grammar textbooks have a long tradition of classifying verbs based on their morphosyntactic characteristics. Conclusions, such as the frequency or productiveness of a class, were drawn without having the insight into a big corpus. Corpora used in such descriptions were not described and were presumably made of literary works which is, in our opinion, describing a form of the Croatian language distant from its everyday use. The corpus used for analyzing verbs in this paper is hrWaC which contains 1.9 billion tokens and about 90,000 verbs. This corpus was selected with the intention of describing and analyzing a less formal and less standardized language This paper offers a corpus-based approach to the problem of verb classification and emphasizes the importance of NLP methods in the process of classification as they fasten and simplify it. The paper gives a brief introduction to verbs, their morphological characteristics and their classification. By extracting verbs from the Croatian web corpus hrWaC and processing them computationally, the paper gives an insight into the verb distribution in the Croatian language and points out some difficulties that were encountered during this study. Even though this paper aimed to reevaluate the existing data data, the present findings mostly confirm the claims of previous researches. A number of recommendations for future research are given, foremost, the need of the extension of the language material.

查看原文本刊更多论文

基于语料库的克罗地亚语动词分类再评价方法

克罗地亚语法教科书有一个悠久的传统，即根据动词的形态句法特征对其进行分类。结论，如课堂的频率或生产力，是在没有深入了解一个大语料库的情况下得出的。在这种描述中使用的语料库没有被描述，可能是由文学作品组成的，我们认为，这是在描述一种远离日常使用的克罗地亚语言。本文使用的动词分析语料库是hrWaC，包含19亿个标记和约9万个动词。本文提供了一种基于语料库的方法来解决动词分类问题，并强调了NLP方法在分类过程中的重要性，因为它们简化了分类过程。本文简要介绍了动词及其形态特征和分类。本文通过从克罗地亚语网络语料库hrWaC中提取动词并对其进行计算处理，深入了解克罗地亚语动词的分布情况，并指出研究过程中遇到的一些困难。尽管本文旨在重新评估现有的数据数据，但目前的研究结果大多证实了以往研究的主张。对未来的研究提出了一些建议，首先，需要扩展语言材料。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

INFuture2019: Knowledge in the Digital Age

自引率

0.00%

发文量