基于网络英语语料库的n-Gram词典、覆盖和信息熵组合分析

Balt. J. Mod. Comput. Pub Date : 2021-06-07 DOI:10.21203/RS.3.RS-237508/V2

A. Malashina

{"title":"基于网络英语语料库的n-Gram词典、覆盖和信息熵组合分析","authors":"A. Malashina","doi":"10.21203/RS.3.RS-237508/V2","DOIUrl":null,"url":null,"abstract":"\n We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.","PeriodicalId":431209,"journal":{"name":"Balt. J. Mod. Comput.","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Combinatorial Analysis of n-Gram Dictionaries, Coverage and Information Entropy based on the Web Corpus of English\",\"authors\":\"A. Malashina\",\"doi\":\"10.21203/RS.3.RS-237508/V2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.\",\"PeriodicalId\":431209,\"journal\":{\"name\":\"Balt. J. Mod. Comput.\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Balt. J. Mod. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21203/RS.3.RS-237508/V2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Balt. J. Mod. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/RS.3.RS-237508/V2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们研究了基于网络英语语料库的n-gram词典，并对其覆盖范围和熵进行了估计。我们考虑了一种估算经验生成字典覆盖率的方法和一种解决低覆盖率缺点的方法。基于Kolmogorov组合方法的思想，我们估计了英语语言的n-gram熵，并使用数学外推来近似边际熵。此外，我们对大阶n-gram的英语语言中所有可能的合法n-gram的数量进行了近似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Combinatorial Analysis of n-Gram Dictionaries, Coverage and Information Entropy based on the Web Corpus of English

We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Balt. J. Mod. Comput.

自引率

0.00%

发文量