{"title":"Exploiting limited data for parsing","authors":"Dongchen Li, Xiantao Zhang, Xihong Wu","doi":"10.1109/ICIS.2014.6912128","DOIUrl":null,"url":null,"abstract":"Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.","PeriodicalId":237256,"journal":{"name":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2014.6912128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.
由于树结构的灵活性,数据稀疏性问题对解析器来说非常严重。许多标记和结果看起来很少,然而,它们对于发生歧义的解析消除至关重要。此外,当一个共同的标签在非规范的位置出现时,它的分布通常是不同的。在本文中,我们提出了一个度量任意跨度大小的短语稀缺性的度量。为了在具有高置信度和稀缺性的训练树之间做出更好的妥协,我们在训练潜在变量语法时试图捕捉一些约束来响应罕见但清晰的类别。我们通过捕获期望与最大化过程和分割与合并框架中速率树结构配置的描绘能力,更充分地利用了有限的数据。结果语法可以解释为我们的意图。在此基础上,我们进一步提出了一种从多个角度利用有限的训练数据,并在产品模型中积累其优势的方法。尽管训练数据有限,但我们的模型提高了Penn Chinese Treebank Fifth Edition上的解析性能,甚至高于一些具有额外未标记数据和外部资源的系统。此外,该方法易于推广到其他自然语言处理任务中的数据稀疏性问题。