Using Meta-Morph Rules to develop Morphological Analysers: A case study concerning Tamil

Finite-State Methods and Natural Language Processing Pub Date : 2019-09-01 DOI:10.18653/v1/W19-3111

Kengatharaiyer Sarveswaran, G. Dias, Miriam Butt

{"title":"Using Meta-Morph Rules to develop Morphological Analysers: A case study concerning Tamil","authors":"Kengatharaiyer Sarveswaran, G. Dias, Miriam Butt","doi":"10.18653/v1/W19-3111","DOIUrl":null,"url":null,"abstract":"This paper describes a new and larger coverage Finite-State Morphological Analyser (FSM) and Generator for the Dravidian language Tamil. The FSM has been developed in the context of computational grammar engineering, adhering to the standards of the ParGram effort. Tamil is a morphologically rich language and the interaction between linguistic analysis and formal implementation is complex, resulting in a challenging task. In order to allow the development of the FSM to focus more on the linguistic analysis and less on the formal details, we have developed a system of meta-morph(ology) rules along with a script which translates these rules into FSM processable representations. The introduction of meta-morph rules makes it possible for computationally naive linguists to interact with the system and to expand it in future work. We found that the meta-morph rules help to express linguistic generalisations and reduce the manual effort of writing lexical classes for morphological analysis. Our Tamil FSM currently handles mainly the inflectional morphology of 3,300 verb roots and their 260 forms. Further, it also has a lexicon of approximately 100,000 nouns along with a guesser to handle out-of-vocabulary items. Although the Tamil FSM was primarily developed to be part of a computational grammar, it can also be used as a web or stand-alone application for other NLP tasks, as per general ParGram practice.","PeriodicalId":286427,"journal":{"name":"Finite-State Methods and Natural Language Processing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Finite-State Methods and Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-3111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

This paper describes a new and larger coverage Finite-State Morphological Analyser (FSM) and Generator for the Dravidian language Tamil. The FSM has been developed in the context of computational grammar engineering, adhering to the standards of the ParGram effort. Tamil is a morphologically rich language and the interaction between linguistic analysis and formal implementation is complex, resulting in a challenging task. In order to allow the development of the FSM to focus more on the linguistic analysis and less on the formal details, we have developed a system of meta-morph(ology) rules along with a script which translates these rules into FSM processable representations. The introduction of meta-morph rules makes it possible for computationally naive linguists to interact with the system and to expand it in future work. We found that the meta-morph rules help to express linguistic generalisations and reduce the manual effort of writing lexical classes for morphological analysis. Our Tamil FSM currently handles mainly the inflectional morphology of 3,300 verb roots and their 260 forms. Further, it also has a lexicon of approximately 100,000 nouns along with a guesser to handle out-of-vocabulary items. Although the Tamil FSM was primarily developed to be part of a computational grammar, it can also be used as a web or stand-alone application for other NLP tasks, as per general ParGram practice.

查看原文本刊更多论文

使用元形态规则开发形态分析器:以泰米尔语为例

本文描述了一种新的、覆盖范围更大的有限状态形态分析器(FSM)和生成器。FSM是在计算语法工程的背景下开发的，遵循ParGram工作的标准。泰米尔语是一种形态丰富的语言，语言分析和形式化实现之间的相互作用非常复杂，这是一项具有挑战性的任务。为了允许FSM的开发更多地关注语言分析而不是形式细节，我们开发了一个元形态(学)规则系统以及一个脚本，该脚本将这些规则转换为FSM可处理的表示。元变形规则的引入使得计算新手语言学家可以与系统交互，并在未来的工作中扩展它。我们发现，元词形规则有助于表达语言的概括，并减少为词形分析编写词汇类的手工工作。我们的泰米尔语FSM目前主要处理3300个动词词根及其260种形式的屈折形态。此外，它还有一个包含大约10万个名词的词典，以及一个用于处理词汇外项目的猜测器。虽然泰米尔FSM主要是作为计算语法的一部分而开发的，但它也可以作为web或独立应用程序用于其他NLP任务，正如一般ParGram实践一样。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Finite-State Methods and Natural Language Processing

自引率

0.00%

发文量