罗马尼亚语印刷词的统计独立性

2020 13th International Conference on Communications (COMM) Pub Date : 2020-06-01 DOI:10.1109/COMM48946.2020.9142045

Alexandru Dinu, A. Vlad, Bogdan Hanu, A. Mitrea

{"title":"罗马尼亚语印刷词的统计独立性","authors":"Alexandru Dinu, A. Vlad, Bogdan Hanu, A. Mitrea","doi":"10.1109/COMM48946.2020.9142045","DOIUrl":null,"url":null,"abstract":"The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.","PeriodicalId":405841,"journal":{"name":"2020 13th International Conference on Communications (COMM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Statistical Independence for Words in Printed Romanian Language\",\"authors\":\"Alexandru Dinu, A. Vlad, Bogdan Hanu, A. Mitrea\",\"doi\":\"10.1109/COMM48946.2020.9142045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.\",\"PeriodicalId\":405841,\"journal\":{\"name\":\"2020 13th International Conference on Communications (COMM)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 13th International Conference on Communications (COMM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMM48946.2020.9142045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 13th International Conference on Communications (COMM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMM48946.2020.9142045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文重新审视了统计独立性的概念印刷罗马尼亚语时，语言被认为是一个词链。该分析是在一个大约有200篇文章的文学语料库上进行的。600万字。我们的目标是提高对自然文本的统计独立性概念的感知，并使用这个概念来评估印刷语言的数值特性。一个主要目标是估计单词之间的最小距离，以确保统计独立性。在这里，我们跟进了作者之前研究过的一个想法——对m-gram (m个连续字母)的统计独立性的调查。先前的结果表明，无论是对32个符号的语料库还是对47个符号的语料库进行分析，100个字符都足以保证字母m-gram (m = 1,2,3)的统计独立性。在本研究中，我们可以注意到，对于最小统计独立性采样距离来说，100个单词实际上已经足够了。由于要考虑的不同词的数量非常多，因此对由低概率词组组成的一个或多个人工词进行了详细的研究(基于先前对词概率调查中第二类统计误差的研究结果)，结果支持上述最小统计独立距离。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Statistical Independence for Words in Printed Romanian Language

The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 13th International Conference on Communications (COMM)

自引率

0.00%

发文量