{"title":"Latent Variable Grammars for Discontinuous Parsing","authors":"Kilian Gebhardt","doi":"10.18653/v1/W19-3103","DOIUrl":null,"url":null,"abstract":"Latent variable context-free grammars are powerful models for predicting the syntactic structure of sentences (Matsuzaki, Miyao, and Tsujii 2005; Petrov, Barrett, et al. 2006; Petrov and Klein 2007). When trained on annotated corpora, the resulting latent variables can be shown to capture different distributions for, e.g., NPs in subject and object position. Several languages (and in consequence also syntactic treebanks for these languages) such as Dutch (Lassy van Noord 2009), German (NeGra, Skut et al. 1997; TiGer Brants et al. 2004), but also English (Penn Treebank, Marcus, Santorini, and Marcinkiewicz 1993, Evang and Kallmeyer 2011) contain structures that cannot be adequately modelled by context-free grammars. In consequence, a class of more power grammar formalisms called mildly context-sensitive has been studied (cf. Kallmeyer 2010). Although parsing with these models is polynomial in the length of the input sentence (Seki et al. 1991), it has for a long time been regarded prohibitively slow. However, in recent years it was shown that the application of mildly-context sensitive grammars is feasible in coarse-to-fine parsing approaches (van Cranenburgh 2012; Ruprecht and Denkinger 2019). In this talk I consider how both the latent variable approach and mildly context-sensitive grammars can be joined and applied to discontinuous treebanks: 1. A large class of latent variable grammars can be captured as a probabilistic regular tree grammar combined with an algebra. I show how the training methodology of latent variable PCFG can be generalized for this class. 2. I recall two mildly context-sensitive grammar formalisms: linear context-free rewriting systems (LCFRS, Vijay-Shanker, Weir, and Joshi 1987) and hybrid grammars (Nederhof and Vogler 2014; Gebhardt, Nederhof, and Vogler 2017). In particular, I consider the induction of hybrid grammars, which can be parametrized such that the polynomial complexity of parsing is of bounded degree. This way also hybrid grammars that are structurally equivalent to finite state automata can be obtained. 3. I analyse different trends when training latent variable LCFRS and hybrid grammars on different discontinuous treebanks and applying them for parsing.","PeriodicalId":286427,"journal":{"name":"Finite-State Methods and Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Finite-State Methods and Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-3103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Latent variable context-free grammars are powerful models for predicting the syntactic structure of sentences (Matsuzaki, Miyao, and Tsujii 2005; Petrov, Barrett, et al. 2006; Petrov and Klein 2007). When trained on annotated corpora, the resulting latent variables can be shown to capture different distributions for, e.g., NPs in subject and object position. Several languages (and in consequence also syntactic treebanks for these languages) such as Dutch (Lassy van Noord 2009), German (NeGra, Skut et al. 1997; TiGer Brants et al. 2004), but also English (Penn Treebank, Marcus, Santorini, and Marcinkiewicz 1993, Evang and Kallmeyer 2011) contain structures that cannot be adequately modelled by context-free grammars. In consequence, a class of more power grammar formalisms called mildly context-sensitive has been studied (cf. Kallmeyer 2010). Although parsing with these models is polynomial in the length of the input sentence (Seki et al. 1991), it has for a long time been regarded prohibitively slow. However, in recent years it was shown that the application of mildly-context sensitive grammars is feasible in coarse-to-fine parsing approaches (van Cranenburgh 2012; Ruprecht and Denkinger 2019). In this talk I consider how both the latent variable approach and mildly context-sensitive grammars can be joined and applied to discontinuous treebanks: 1. A large class of latent variable grammars can be captured as a probabilistic regular tree grammar combined with an algebra. I show how the training methodology of latent variable PCFG can be generalized for this class. 2. I recall two mildly context-sensitive grammar formalisms: linear context-free rewriting systems (LCFRS, Vijay-Shanker, Weir, and Joshi 1987) and hybrid grammars (Nederhof and Vogler 2014; Gebhardt, Nederhof, and Vogler 2017). In particular, I consider the induction of hybrid grammars, which can be parametrized such that the polynomial complexity of parsing is of bounded degree. This way also hybrid grammars that are structurally equivalent to finite state automata can be obtained. 3. I analyse different trends when training latent variable LCFRS and hybrid grammars on different discontinuous treebanks and applying them for parsing.
潜在变量上下文无关语法是预测句子句法结构的强大模型(Matsuzaki, Miyao, and Tsujii 2005;Petrov, Barrett等,2006;Petrov和Klein 2007)。当在带注释的语料库上训练时,产生的潜在变量可以显示为捕获不同的分布,例如,NPs在主题和对象位置。几种语言(以及这些语言的语法树库),如荷兰语(Lassy van Noord 2009),德语(NeGra, Skut et al. 1997);TiGer Brants et al. 2004),但英语(Penn Treebank, Marcus, Santorini, and Marcinkiewicz 1993, Evang and Kallmeyer 2011)也包含无法通过上下文无关语法充分建模的结构。因此,研究了一类更强大的语法形式,称为轻度上下文敏感(cf. Kallmeyer 2010)。虽然这些模型的解析在输入句子的长度上是多项式的(Seki et al. 1991),但长期以来人们认为它太慢了。然而,近年来的研究表明,在从粗到精的解析方法中应用轻度上下文敏感语法是可行的(van Cranenburgh 2012;Ruprecht and Denkinger 2019)。在这次演讲中,我将考虑如何将潜在变量方法和轻度上下文敏感语法结合并应用于不连续树库:一大类潜在变量语法可以被捕获为与代数相结合的概率规则树语法。我展示了潜在变量PCFG的训练方法如何推广到这门课。2. 我想起了两种轻度上下文敏感的语法形式:线性上下文无关重写系统(LCFRS, Vijay-Shanker, Weir, and Joshi 1987)和混合语法(Nederhof and Vogler 2014;Gebhardt, Nederhof, and Vogler 2017)。特别地,我考虑了混合语法的归纳,它可以被参数化,使得解析的多项式复杂度是有界的。这种方法也可以获得结构上等同于有限状态自动机的混合语法。3.分析了在不同的不连续树库上训练潜变量LCFRS和混合语法时的不同趋势,并将其应用于句法分析。