Normalized dataset for Sanskrit word segmentation and morphological parsing

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2024-08-28 DOI:10.1007/s10579-024-09724-0

Sriram Krishnan, Amba Kulkarni, Gérard Huet

{"title":"Normalized dataset for Sanskrit word segmentation and morphological parsing","authors":"Sriram Krishnan, Amba Kulkarni, Gérard Huet","doi":"10.1007/s10579-024-09724-0","DOIUrl":null,"url":null,"abstract":"<p>Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and <i>Saṃsādhanī</i> tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09724-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and Saṃsādhanī tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.

Abstract Image

查看原文本刊更多论文

用于梵语单词分割和形态解析的规范化数据集

在过去十年中，梵语处理中数据驱动方法的使用激增。尽管与其他语言相比，使用的数据集相对有限，但通过开发最先进的模型，已经解决了分段、形态解析和依赖性分析等各种任务。然而，一个重大的挑战在于如何获得带有词法、词形、句法和语义标记的注释数据集。句法和语义标记更适合后期处理阶段，如句法分析和消歧，而词法和形态标记则对单词分割和形态分析等低级任务至关重要。梵文数字语料库（DCS）是一项值得注意的工作，该语料库收录了来自约 250 个文本的超过 650,000 个带有词法和词形标签的句子，但在句子的不同层次（如词块、词段、词干和词形分析）上也有其局限性。为了克服这些局限性，我们研究了梵文遗产分段器（SH）和 Saṃsādhanī 工具等替代工具，它们能提供补充 DCS 数据的信息。这项工作的重点是通过纳入 SH 的分析来丰富 DCS 数据集，从而创建一个词法和形态信息丰富的数据集。此外，这项工作还讨论了此类数据集对现有分词器（特别是梵文遗产分词器）性能的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.