A Statistically Significant Test to Evaluate the Order or Disorder of a Binary String

2020 31st Irish Signals and Systems Conference (ISSC) Pub Date : 2020-06-01 DOI:10.1109/ISSC49989.2020.9180178

J. Blackledge, N. Mosola

{"title":"A Statistically Significant Test to Evaluate the Order or Disorder of a Binary String","authors":"J. Blackledge, N. Mosola","doi":"10.1109/ISSC49989.2020.9180178","DOIUrl":null,"url":null,"abstract":"This paper addresses a basic problem in regard to the analysis of a finite binary string or bit stream (of compact support), namely, how to tell whether the string is representative of non-random or intelligible information (involving some form of periodicity, for example), whether it is the product of an entirely random process or whether it is something in between the two. This problem has applications that include cryptanalysis, quantitative finance, machine learning, artificial intelligence and other forms of signal and image processing involving the general problem of how to distinguishing real noise from information embedded in noise, for example. After providing a short introduction to the problem, we focus on the application of information entropy for solving the problem given that this fundamental metric is an intrinsic measure on information in regard to some measurable system. A brief overview on the concept of entropy is given followed by examples of how algorithms can be design to compute the binary entropy of a finite binary string including important variations on a theme such as the BiEntropy. The problem with computing a single metric of this type is that it can be representative of similar binary strings and lacks robustness in terms of its statistically significance. For this reasons, the paper presents a solution to the problem that is based on the Kullback-Leibler Divergence (or Relative Entropy) which yields a measure of how one probability distribution is different from another reference probability distribution. By repeatedly computing this metric for different reference (simulated or otherwise) random finite binary strings, it is shown how the distribution of the resulting signal changes for intelligible and random binary strings of a finite extent. This allows a number of standard statistical metrics to be computed from which the foundations for a machine learning system can be developed. A limited number of results are present for different natural languages to illustrate the approach, a prototype MATLAB function being provide for interested readers to reproduce the results given as required, investigate different data sets and further develop the method considered.","PeriodicalId":351013,"journal":{"name":"2020 31st Irish Signals and Systems Conference (ISSC)","volume":"55 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 31st Irish Signals and Systems Conference (ISSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSC49989.2020.9180178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper addresses a basic problem in regard to the analysis of a finite binary string or bit stream (of compact support), namely, how to tell whether the string is representative of non-random or intelligible information (involving some form of periodicity, for example), whether it is the product of an entirely random process or whether it is something in between the two. This problem has applications that include cryptanalysis, quantitative finance, machine learning, artificial intelligence and other forms of signal and image processing involving the general problem of how to distinguishing real noise from information embedded in noise, for example. After providing a short introduction to the problem, we focus on the application of information entropy for solving the problem given that this fundamental metric is an intrinsic measure on information in regard to some measurable system. A brief overview on the concept of entropy is given followed by examples of how algorithms can be design to compute the binary entropy of a finite binary string including important variations on a theme such as the BiEntropy. The problem with computing a single metric of this type is that it can be representative of similar binary strings and lacks robustness in terms of its statistically significance. For this reasons, the paper presents a solution to the problem that is based on the Kullback-Leibler Divergence (or Relative Entropy) which yields a measure of how one probability distribution is different from another reference probability distribution. By repeatedly computing this metric for different reference (simulated or otherwise) random finite binary strings, it is shown how the distribution of the resulting signal changes for intelligible and random binary strings of a finite extent. This allows a number of standard statistical metrics to be computed from which the foundations for a machine learning system can be developed. A limited number of results are present for different natural languages to illustrate the approach, a prototype MATLAB function being provide for interested readers to reproduce the results given as required, investigate different data sets and further develop the method considered.

查看原文本刊更多论文

评估二进制字符串的有序或无序的统计显著性检验

本文解决了关于有限二进制字符串或位流(紧支持)分析的一个基本问题，即如何判断字符串是否代表非随机或可理解的信息(例如，涉及某种形式的周期性)，它是完全随机过程的产物还是介于两者之间的东西。这个问题的应用包括密码分析、定量金融、机器学习、人工智能和其他形式的信号和图像处理，例如，涉及如何区分真实噪声和嵌入在噪声中的信息的一般问题。在对这个问题做了简短的介绍之后，我们重点讨论了信息熵在解决这个问题中的应用，因为这个基本度量是关于某些可测量系统的信息的内在度量。简要概述了熵的概念，然后给出了如何设计算法来计算有限二进制字符串的二进制熵的示例，其中包括BiEntropy等主题的重要变化。计算这种类型的单个度量的问题是，它可以代表相似的二进制字符串，并且在统计显著性方面缺乏鲁棒性。由于这个原因，本文提出了一个基于Kullback-Leibler散度(或相对熵)的问题解决方案，它产生了一个概率分布与另一个参考概率分布如何不同的度量。通过对不同参考(模拟或其他)随机有限二进制字符串重复计算该度量，显示了有限范围的可理解和随机二进制字符串的结果信号分布如何变化。这允许计算一些标准的统计指标，从中可以开发机器学习系统的基础。对于不同的自然语言，提供了有限数量的结果来说明该方法，为感兴趣的读者提供了一个原型MATLAB函数，以根据需要重现给出的结果，研究不同的数据集并进一步发展所考虑的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 31st Irish Signals and Systems Conference (ISSC)

自引率

0.00%

发文量