用于语言建模的广义层次词序列框架

Q4 Computer Science

Journal of Information Processing Pub Date : 2017-06-15 DOI:10.5715/JNLP.24.395

Xiaoyi Wu, Kevin Duh, Yuji Matsumoto

{"title":"用于语言建模的广义层次词序列框架","authors":"Xiaoyi Wu, Kevin Duh, Yuji Matsumoto","doi":"10.5715/JNLP.24.395","DOIUrl":null,"url":null,"abstract":"Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.","PeriodicalId":16243,"journal":{"name":"Journal of Information Processing","volume":"24 1","pages":"395-419"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generalized Hierarchical Word Sequence Framework for Language Modeling\",\"authors\":\"Xiaoyi Wu, Kevin Duh, Yuji Matsumoto\",\"doi\":\"10.5715/JNLP.24.395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.\",\"PeriodicalId\":16243,\"journal\":{\"name\":\"Journal of Information Processing\",\"volume\":\"24 1\",\"pages\":\"395-419\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5715/JNLP.24.395\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5715/JNLP.24.395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

语言建模是一个基础研究问题，在许多自然语言处理任务中有着广泛的应用。为了估计自然语言句子的概率，大多数语言建模研究都使用基于n-gram的方法来考虑句子的概率。然而，n-gram模型下的假设鲁棒性不足，无法解决数据稀疏性问题，影响了语言模型的最终性能。在本文中，我们提出了一个广义的分层词序列框架，其中不同的词关联分数可以采用完全无监督的方式重新排列词序列。与n-gram从左到右考虑句子概率不同，我们的模型使用了更灵活的策略。为了评估，我们将重新排列的单词序列与正常的n-gram单词序列进行比较。内在实验和外在实验都验证了我们的语言模型可以达到更好的性能，证明我们的方法可以被认为是n-gram语言模型的更好替代。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Generalized Hierarchical Word Sequence Framework for Language Modeling

Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information Processing Computer Science-Computer Science (all)

CiteScore

1.20

自引率

0.00%

发文量