The complexity of DNA sequences. Different approaches and definitions

Q3 Mathematics

Mathematical Biology and Bioinformatics Pub Date : 2020-11-30 DOI:10.17537/2020.15.313

V. Gusev, L. A. Miroshnichenko

{"title":"The complexity of DNA sequences. Different approaches and definitions","authors":"V. Gusev, L. A. Miroshnichenko","doi":"10.17537/2020.15.313","DOIUrl":null,"url":null,"abstract":"\nAn important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their \"non-randomness\". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its \"description\", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them.\nMost of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.\n","PeriodicalId":53525,"journal":{"name":"Mathematical Biology and Bioinformatics","volume":"306 1","pages":"313-337"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17537/2020.15.313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

An important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their "non-randomness". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its "description", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them. Most of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.

查看原文本刊更多论文

DNA序列的复杂性。不同的方法和定义

符号序列(文本、字符串)的一个重要数量特征是复杂性，它直观地反映了符号序列的“非随机性”程度。A.N.柯尔莫哥洛夫给出了复杂性最一般的定义。他提出用最短描述的长度来衡量一个物体(符号序列)的复杂性，通过最短描述的长度可以唯一地重建这个物体。由于没有保证搜索最短描述的程序，在实践中，本文考虑的各种算法近似都用于此目的。随着复杂性的定义，表明从其“描述”重建序列的可能性，考虑了一些不意味着这种恢复的措施。它们是基于一些定量特征的计算。感兴趣的不仅是对复杂性的定量评估，而且是确定其具体价值的结构规律的识别和分类。它们以一种或另一种形式表现为最广泛意义上的重复论证。考虑的复杂性度量通常分为统计度量(考虑文本中符号或短“词”的出现频率)、“字典”度量(估计不同“子词”的数量)和“结构”度量(基于识别文本的长重复片段和确定它们之间的关系)。大多数方法是为具有任意语言性质的序列而设计的。文章的标题反映了对DNA序列的特别关注，这是由于对象的重要性，不同类型重复的表现，以及在解决各种生物对象的分类和进化问题时使用复杂性概念的众多例子。在DNA序列的滑动窗口模式中发现的局部结构特征是相当有趣的，因为各种生物体基因组中的低复杂性区域通常与基本遗传过程的调节有关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mathematical Biology and Bioinformatics Mathematics-Applied Mathematics

CiteScore

1.10

自引率

0.00%

发文量