Exploiting limited data for parsing

Dongchen Li, Xiantao Zhang, Xihong Wu
{"title":"Exploiting limited data for parsing","authors":"Dongchen Li, Xiantao Zhang, Xihong Wu","doi":"10.1109/ICIS.2014.6912128","DOIUrl":null,"url":null,"abstract":"Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.","PeriodicalId":237256,"journal":{"name":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2014.6912128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.
利用有限的数据进行解析
由于树结构的灵活性,数据稀疏性问题对解析器来说非常严重。许多标记和结果看起来很少,然而,它们对于发生歧义的解析消除至关重要。此外,当一个共同的标签在非规范的位置出现时,它的分布通常是不同的。在本文中,我们提出了一个度量任意跨度大小的短语稀缺性的度量。为了在具有高置信度和稀缺性的训练树之间做出更好的妥协,我们在训练潜在变量语法时试图捕捉一些约束来响应罕见但清晰的类别。我们通过捕获期望与最大化过程和分割与合并框架中速率树结构配置的描绘能力,更充分地利用了有限的数据。结果语法可以解释为我们的意图。在此基础上,我们进一步提出了一种从多个角度利用有限的训练数据,并在产品模型中积累其优势的方法。尽管训练数据有限,但我们的模型提高了Penn Chinese Treebank Fifth Edition上的解析性能,甚至高于一些具有额外未标记数据和外部资源的系统。此外,该方法易于推广到其他自然语言处理任务中的数据稀疏性问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信