用于语言建模的广义层次词序列框架

Q4 Computer Science
Xiaoyi Wu, Kevin Duh, Yuji Matsumoto
{"title":"用于语言建模的广义层次词序列框架","authors":"Xiaoyi Wu, Kevin Duh, Yuji Matsumoto","doi":"10.5715/JNLP.24.395","DOIUrl":null,"url":null,"abstract":"Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.","PeriodicalId":16243,"journal":{"name":"Journal of Information Processing","volume":"24 1","pages":"395-419"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generalized Hierarchical Word Sequence Framework for Language Modeling\",\"authors\":\"Xiaoyi Wu, Kevin Duh, Yuji Matsumoto\",\"doi\":\"10.5715/JNLP.24.395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.\",\"PeriodicalId\":16243,\"journal\":{\"name\":\"Journal of Information Processing\",\"volume\":\"24 1\",\"pages\":\"395-419\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5715/JNLP.24.395\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5715/JNLP.24.395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

摘要

语言建模是一个基础研究问题,在许多自然语言处理任务中有着广泛的应用。为了估计自然语言句子的概率,大多数语言建模研究都使用基于n-gram的方法来考虑句子的概率。然而,n-gram模型下的假设鲁棒性不足,无法解决数据稀疏性问题,影响了语言模型的最终性能。在本文中,我们提出了一个广义的分层词序列框架,其中不同的词关联分数可以采用完全无监督的方式重新排列词序列。与n-gram从左到右考虑句子概率不同,我们的模型使用了更灵活的策略。为了评估,我们将重新排列的单词序列与正常的n-gram单词序列进行比较。内在实验和外在实验都验证了我们的语言模型可以达到更好的性能,证明我们的方法可以被认为是n-gram语言模型的更好替代。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Generalized Hierarchical Word Sequence Framework for Language Modeling
Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Information Processing
Journal of Information Processing Computer Science-Computer Science (all)
CiteScore
1.20
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信