About Similarity Measures of Components Arrangement of Naturally Ordered Data Arrays

Q3 Mathematics
A. Gumenyuk, A. Skiba, N. Pozdnichenko, S. Shpynov
{"title":"About Similarity Measures of Components Arrangement of Naturally Ordered Data Arrays","authors":"A. Gumenyuk, A. Skiba, N. Pozdnichenko, S. Shpynov","doi":"10.15622/SP.18.2.471-503","DOIUrl":null,"url":null,"abstract":"At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. \"The curse of a priori unconscious knowledge\" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.18.2.471-503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 3

Abstract

At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. "The curse of a priori unconscious knowledge" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.
自然有序数据数组中组件排列的相似性度量
目前,还没有足够的数学工具来分析不同性质的自然有序数据阵列中成分的排列,包括文本中的单词或字母、音乐作品中的音符、符号序列中的符号、监测数据、表示有序测量结果的数字、遗传文本中的成分。因此,很难或不可能测量和比较在长信息链中分配的消息的顺序。比较符号序列的主要方法是使用概率模型和统计工具,两两比对和多重比对,这使得使用编辑距离度量来确定序列的相似程度成为可能。伪谱和分形表示在符号序列中的应用是比较新奇的。“先验无意识知识的诅咒”对序列的明显有序性应该特别注意,因为它在数学语言学、生物信息学(数学生物学)和其他类似的科学领域中广泛存在。注意到的方法几乎不注意研究和检测构成单独序列的所有符号、单词和数据集组件的特定排列模式。在我们的作品中,研究的对象是一个特定组织的数字元组——按符号或数字序列排列的组件(顺序)。顺序中最接近的相同分量之间的间隔被用作链排列的定量表示的基础。将所有区间相乘或对其对数求和,可以得到唯一反映特定序列中成分排列的数字。这些数,使我们可以得到一整套归一化的阶特征,其中的几何平均区间及其对数。这些特征惊人地准确地反映了符号序列中组成部分的排列。在本文中,我们提出了一种定量比较任意性质的自然有序数据(信息链)数组排列的方法。提出了基于子序列(组件)的顺序特征选择相等和相似列表的相似性/区别度量和链顺序比较程序。秩分布用于更快地选择匹配组件列表。本文介绍了一个比较信息链顺序的工具箱,并演示了它在研究核苷酸序列结构方面的一些应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SPIIRAS Proceedings
SPIIRAS Proceedings Mathematics-Applied Mathematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
14 weeks
期刊介绍: The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信