基于图的标注工程:面向角色和参考语法的黄金语料库

International Conference on Language, Data, and Knowledge Pub Date : 1900-01-01 DOI:10.4230/OASIcs.LDK.2019.9

C. Chiarcos, Christian Fäth

{"title":"基于图的标注工程:面向角色和参考语法的黄金语料库","authors":"C. Chiarcos, Christian Fäth","doi":"10.4230/OASIcs.LDK.2019.9","DOIUrl":null,"url":null,"abstract":"This paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its crosslinguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrate this for the English Web Treebank, a 250,000 token corpus of various genres of English internet text. The resulting corpus is a gold corpus for future experiments in natural language processing in the sense that it is built on existing annotations which have been created manually. A technical challenge in this context is to align UD and PB annotations, to integrate them in a coherent manner, and to distribute and to combine their information on RRG constituent and operator projections. For this purpose, we describe a framework for flexible and scalable annotation engineering based on flexible, unconstrained graph transformations of sentence graphs by means of SPARQL Update. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Semantic web description languages; Computing methodologies → Natural language processing; Computing methodologies → Lexical semantics","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"307 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar\",\"authors\":\"C. Chiarcos, Christian Fäth\",\"doi\":\"10.4230/OASIcs.LDK.2019.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its crosslinguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrate this for the English Web Treebank, a 250,000 token corpus of various genres of English internet text. The resulting corpus is a gold corpus for future experiments in natural language processing in the sense that it is built on existing annotations which have been created manually. A technical challenge in this context is to align UD and PB annotations, to integrate them in a coherent manner, and to distribute and to combine their information on RRG constituent and operator projections. For this purpose, we describe a framework for flexible and scalable annotation engineering based on flexible, unconstrained graph transformations of sentence graphs by means of SPARQL Update. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Semantic web description languages; Computing methodologies → Natural language processing; Computing methodologies → Lexical semantics\",\"PeriodicalId\":377119,\"journal\":{\"name\":\"International Conference on Language, Data, and Knowledge\",\"volume\":\"307 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Language, Data, and Knowledge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.LDK.2019.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2019.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

本文介绍了应用注释工程技术构建角色与参考语法(RRG)语料库的方法。RRG是比较语言学和语言类型学中流行的一种以语义为导向的自然语言语法形式主义，主要用于描述在自然语言处理方面资源较少的非欧洲语言。由于其跨语言的适用性以及对语法和语义的联合处理，RRG也代表了自然语言处理研究挑战的一个有前途的框架。然而，由于没有公开可用的RRG语料库数据，目前还没有对这些问题进行探讨。虽然RRG注释不能轻易地从现有的任何一个树库中推导出来，但我们建议它们可以从语法和语义注释的交叉点中可靠地推断出来，例如，通用依赖关系(UD)和PropBank (PB)，我们为英语Web树库演示了这一点，这是一个包含各种英语网络文本类型的25万个令牌语料库。由此产生的语料库是未来自然语言处理实验的黄金语料库，因为它是建立在人工创建的现有注释之上的。在这种情况下，一个技术挑战是对齐UD和PB注释，以一致的方式集成它们，并在RRG组成和算子投影上分发和组合它们的信息。为此，我们通过SPARQL Update描述了一个灵活的、可扩展的注释工程框架，该框架基于灵活的、无约束的句子图图转换。信息系统→语义网络描述语言;计算方法→自然语言处理;计算方法→词汇语义

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar

This paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its crosslinguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrate this for the English Web Treebank, a 250,000 token corpus of various genres of English internet text. The resulting corpus is a gold corpus for future experiments in natural language processing in the sense that it is built on existing annotations which have been created manually. A technical challenge in this context is to align UD and PB annotations, to integrate them in a coherent manner, and to distribute and to combine their information on RRG constituent and operator projections. For this purpose, we describe a framework for flexible and scalable annotation engineering based on flexible, unconstrained graph transformations of sentence graphs by means of SPARQL Update. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Semantic web description languages; Computing methodologies → Natural language processing; Computing methodologies → Lexical semantics

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Language, Data, and Knowledge

自引率

0.00%

发文量