A. Gumenyuk, A. Skiba, N. Pozdnichenko, S. Shpynov
{"title":"自然有序数据数组中组件排列的相似性度量","authors":"A. Gumenyuk, A. Skiba, N. Pozdnichenko, S. Shpynov","doi":"10.15622/SP.18.2.471-503","DOIUrl":null,"url":null,"abstract":"At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. \"The curse of a priori unconscious knowledge\" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"About Similarity Measures of Components Arrangement of Naturally Ordered Data Arrays\",\"authors\":\"A. Gumenyuk, A. Skiba, N. Pozdnichenko, S. Shpynov\",\"doi\":\"10.15622/SP.18.2.471-503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. \\\"The curse of a priori unconscious knowledge\\\" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.\",\"PeriodicalId\":53447,\"journal\":{\"name\":\"SPIIRAS Proceedings\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SPIIRAS Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15622/SP.18.2.471-503\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.18.2.471-503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
About Similarity Measures of Components Arrangement of Naturally Ordered Data Arrays
At present, adequate mathematical tools are not used to analyze the arrangement of components in arrays of naturally ordered data of a different nature, including words or letters in texts, notes in musical compositions, symbols in sign sequences, monitoring data, numbers representing ordered measurement results, components in genetic texts. Therefore, it is difficult or impossible to measure and compare the order of messages allocated in long information chains. The main approaches for comparing symbol sequences are using probabilistic models and statistical tools, pairwise and multiple alignment, which makes it possible to determine the degree of similarity of sequences using edit distance measures. The application of pseudospectral and fractal representation of symbolic sequences is somewhat exotic. "The curse of a priori unconscious knowledge" of the obvious orderliness of the sequence should be especially noticed, as it is widespread in mathematical linguistics, bioinformatics (mathematical biology), and other similar fields of science. The noted approaches almost do not pay attention to the study and detection of the patterns of the specific arrangement of all symbols, words, and components of data sets that constitute a separate sequence. The object of study in our works is a specifically organized numerical tuple – the arrangement of components (order) in symbolic or numerical sequence. The intervals between the closest identical components of the order are used as the basis for the quantitative representation of the chain arrangement. Multiplying all the intervals or summing their logarithms allows one to get numbers that uniquely reflect the arrangement of components in a particular sequence. These numbers, allow us to obtain a whole set of normalized characteristics of the order, among which the geometric mean interval and its logarithm. Such characteristics surprisingly accurately reflect the arrangement of the components in the symbolic sequences. In this paper, we present an approach for quantitative comparing the arrangement of arrays of naturally ordered data (information chains) of an arbitrary nature. The measures of similarity/distinction and procedure of comparison of the chain order, based on the selection of a list of equal and similar by the order characteristics of the subsequences (components), are proposed. Rank distributions are used for faster selection of a list of matching components. The paper presents a toolkit for comparing the order of information chains and demonstrates some of its applications for studying the structure of nucleotide sequences.
期刊介绍:
The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.