{"title":"A Statistically Significant Test to Evaluate the Order or Disorder of a Binary String","authors":"J. Blackledge, N. Mosola","doi":"10.1109/ISSC49989.2020.9180178","DOIUrl":null,"url":null,"abstract":"This paper addresses a basic problem in regard to the analysis of a finite binary string or bit stream (of compact support), namely, how to tell whether the string is representative of non-random or intelligible information (involving some form of periodicity, for example), whether it is the product of an entirely random process or whether it is something in between the two. This problem has applications that include cryptanalysis, quantitative finance, machine learning, artificial intelligence and other forms of signal and image processing involving the general problem of how to distinguishing real noise from information embedded in noise, for example. After providing a short introduction to the problem, we focus on the application of information entropy for solving the problem given that this fundamental metric is an intrinsic measure on information in regard to some measurable system. A brief overview on the concept of entropy is given followed by examples of how algorithms can be design to compute the binary entropy of a finite binary string including important variations on a theme such as the BiEntropy. The problem with computing a single metric of this type is that it can be representative of similar binary strings and lacks robustness in terms of its statistically significance. For this reasons, the paper presents a solution to the problem that is based on the Kullback-Leibler Divergence (or Relative Entropy) which yields a measure of how one probability distribution is different from another reference probability distribution. By repeatedly computing this metric for different reference (simulated or otherwise) random finite binary strings, it is shown how the distribution of the resulting signal changes for intelligible and random binary strings of a finite extent. This allows a number of standard statistical metrics to be computed from which the foundations for a machine learning system can be developed. A limited number of results are present for different natural languages to illustrate the approach, a prototype MATLAB function being provide for interested readers to reproduce the results given as required, investigate different data sets and further develop the method considered.","PeriodicalId":351013,"journal":{"name":"2020 31st Irish Signals and Systems Conference (ISSC)","volume":"55 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 31st Irish Signals and Systems Conference (ISSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSC49989.2020.9180178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper addresses a basic problem in regard to the analysis of a finite binary string or bit stream (of compact support), namely, how to tell whether the string is representative of non-random or intelligible information (involving some form of periodicity, for example), whether it is the product of an entirely random process or whether it is something in between the two. This problem has applications that include cryptanalysis, quantitative finance, machine learning, artificial intelligence and other forms of signal and image processing involving the general problem of how to distinguishing real noise from information embedded in noise, for example. After providing a short introduction to the problem, we focus on the application of information entropy for solving the problem given that this fundamental metric is an intrinsic measure on information in regard to some measurable system. A brief overview on the concept of entropy is given followed by examples of how algorithms can be design to compute the binary entropy of a finite binary string including important variations on a theme such as the BiEntropy. The problem with computing a single metric of this type is that it can be representative of similar binary strings and lacks robustness in terms of its statistically significance. For this reasons, the paper presents a solution to the problem that is based on the Kullback-Leibler Divergence (or Relative Entropy) which yields a measure of how one probability distribution is different from another reference probability distribution. By repeatedly computing this metric for different reference (simulated or otherwise) random finite binary strings, it is shown how the distribution of the resulting signal changes for intelligible and random binary strings of a finite extent. This allows a number of standard statistical metrics to be computed from which the foundations for a machine learning system can be developed. A limited number of results are present for different natural languages to illustrate the approach, a prototype MATLAB function being provide for interested readers to reproduce the results given as required, investigate different data sets and further develop the method considered.