Shallow syntax analysis in Sanskrit guided by semantic nets constraints

International Workshop On Research Issues in Digital Libraries Pub Date : 2006-12-12 DOI:10.1145/1364742.1364750

G. Huet

{"title":"Shallow syntax analysis in Sanskrit guided by semantic nets constraints","authors":"G. Huet","doi":"10.1145/1364742.1364750","DOIUrl":null,"url":null,"abstract":"We present the state of the art of a computational platform for the analysis of classical Sanskrit. The platform comprises modules for phonology, morphology, segmentation and shallow syntax analysis, organized around a structured lexical database. It relies on the Zen toolkit for finite state automata and transducers, which provides data structures and algorithms for the modular construction and execution of finite state machines, in a functional framework.\n Some of the layers proceed in bottom-up synthesis mode - for instance, noun and verb morphological modules generate all inflected forms from stems and roots listed in the lexicon. Morphemes are assembled through internal sandhi, and the inflected forms are stored with morphological tags in dictionaries usable for lemmatizing. These dictionaries are then compiled into transducers, implementing the analysis of external sandhi, the phonological process which merges words together by euphony. This provides a tagging segmenter, which analyses a sentence presented as a stream of phonemes and produces a stream of tagged lexical entries, hyperlinked to the lexicon.\n The next layer is a syntax analyser, guided by semantic nets constraints expressing dependencies between the word forms. Finite verb forms demand semantic roles, according to valency patterns depending on the voice (active, passive) of the form and the governance (transitive, etc) of the root. Conversely, noun/adjective forms provide actors which may fill those roles, provided agreement constraints are satisfied. Tool words are mapped to transducers operating on tagged streams, allowing the modeling of linguistic phenomena such as coordination by abstract interpretation of actor streams. The parser ranks the various interpretations (matching actors with roles) with penalties, and returns to the user the minimum penalty analyses, for final validation of ambiguities. The whole platform is organized as a Web service, allowing the piecewise tagging of a Sanskrit text.","PeriodicalId":287514,"journal":{"name":"International Workshop On Research Issues in Digital Libraries","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop On Research Issues in Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1364742.1364750","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

We present the state of the art of a computational platform for the analysis of classical Sanskrit. The platform comprises modules for phonology, morphology, segmentation and shallow syntax analysis, organized around a structured lexical database. It relies on the Zen toolkit for finite state automata and transducers, which provides data structures and algorithms for the modular construction and execution of finite state machines, in a functional framework. Some of the layers proceed in bottom-up synthesis mode - for instance, noun and verb morphological modules generate all inflected forms from stems and roots listed in the lexicon. Morphemes are assembled through internal sandhi, and the inflected forms are stored with morphological tags in dictionaries usable for lemmatizing. These dictionaries are then compiled into transducers, implementing the analysis of external sandhi, the phonological process which merges words together by euphony. This provides a tagging segmenter, which analyses a sentence presented as a stream of phonemes and produces a stream of tagged lexical entries, hyperlinked to the lexicon. The next layer is a syntax analyser, guided by semantic nets constraints expressing dependencies between the word forms. Finite verb forms demand semantic roles, according to valency patterns depending on the voice (active, passive) of the form and the governance (transitive, etc) of the root. Conversely, noun/adjective forms provide actors which may fill those roles, provided agreement constraints are satisfied. Tool words are mapped to transducers operating on tagged streams, allowing the modeling of linguistic phenomena such as coordination by abstract interpretation of actor streams. The parser ranks the various interpretations (matching actors with roles) with penalties, and returns to the user the minimum penalty analyses, for final validation of ambiguities. The whole platform is organized as a Web service, allowing the piecewise tagging of a Sanskrit text.

查看原文本刊更多论文

语义网约束下的梵文浅语法分析

我们提出了一个用于分析古典梵文的计算平台的最新状态。该平台包括音系、词法、分词和浅层语法分析模块，围绕一个结构化的词汇数据库进行组织。它依赖于有限状态自动机和传感器的Zen工具包，该工具包在功能框架中为有限状态机的模块化构造和执行提供数据结构和算法。有些层以自下而上的合成模式进行——例如，名词和动词形态模块根据词典中列出的词干和词根生成所有的屈折形式。语素通过内部变调组合，词形变化形式与词形标记一起存储在词典中，用于词法转换。然后，这些词典被编译成换能器，实现外部变调的分析，即通过谐音将单词合并在一起的语音过程。它提供了一个标记切分器，该切分器分析作为音素流呈现的句子，并生成标记的词汇条目流，这些条目与词汇有超链接。下一层是语法分析器，由表达词形式之间依赖关系的语义网约束指导。有限动词形式需要语义角色，根据形式的语态(主动、被动)和词根的支配(及物等)的配价模式。相反，名词/形容词形式提供了可以填补这些角色的行动者，前提是协议约束得到满足。工具词被映射到在标记流上操作的换能器，允许对语言现象进行建模，例如通过对行动者流的抽象解释进行协调。解析器根据惩罚对各种解释(匹配参与者和角色)进行排序，并将最小惩罚分析返回给用户，以便对歧义进行最终验证。整个平台被组织为一个Web服务，允许对梵文文本进行分段标记。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop On Research Issues in Digital Libraries

自引率

0.00%

发文量