{"title":"罗马尼亚语印刷词的统计独立性","authors":"Alexandru Dinu, A. Vlad, Bogdan Hanu, A. Mitrea","doi":"10.1109/COMM48946.2020.9142045","DOIUrl":null,"url":null,"abstract":"The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.","PeriodicalId":405841,"journal":{"name":"2020 13th International Conference on Communications (COMM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Statistical Independence for Words in Printed Romanian Language\",\"authors\":\"Alexandru Dinu, A. Vlad, Bogdan Hanu, A. Mitrea\",\"doi\":\"10.1109/COMM48946.2020.9142045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.\",\"PeriodicalId\":405841,\"journal\":{\"name\":\"2020 13th International Conference on Communications (COMM)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 13th International Conference on Communications (COMM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMM48946.2020.9142045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 13th International Conference on Communications (COMM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMM48946.2020.9142045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Statistical Independence for Words in Printed Romanian Language
The paper revisits the notion of statistical independence for printed Romanian when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language. One main objective is to estimate the minimum distance in words that ensures statistical independence.Here, we followed up on an idea previously researched by the authors - the investigation of statistical independence for m-grams (m successive letters). The previous results showed that 100 characters are enough to ensure statistical independence for letter m-grams (m = 1, 2, 3) either for the 32-symbol corpus or when the 47-symbol corpus was analyzed. In the present research, we could notice that 100 words can be considered practically enough for the minimum statistical independence sampling distance. As there is a huge number of distinct words to be considered, detailed investigations have been conducted regarding the creation of one or more Artificial Words consisting of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the above-mentioned minimum statistical independence distance.