Algorithms for Molecular Biology最新文献_第8页

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. 为字符串集合构造后缀数组、LCP数组和bwt。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-09-22 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00177-y

Felipe A Louza, Guilherme P Telles, Simon Gog, Nicola Prezza, Giovanna Rosone

{"title":"gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.","authors":"Felipe A Louza, Guilherme P Telles, Simon Gog, Nicola Prezza, Giovanna Rosone","doi":"10.1186/s13015-020-00177-y","DOIUrl":"https://doi.org/10.1186/s13015-020-00177-y","url":null,"abstract":"Background: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections.Result: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings.Conclusions: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"18"},"PeriodicalIF":1.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00177-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38417629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

A linear-time algorithm that avoids inverses and computes Jackknife (leave-one-out) products like convolutions or other operators in commutative semigroups. 一种线性时间算法，它避免了逆运算，并计算可交换半群中的折刀(留一)积，如卷积或其他算子。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-09-19 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00178-x

John L Spouge, Joseph M Ziegelbauer, Mileidy Gonzalez

{"title":"A linear-time algorithm that avoids inverses and computes Jackknife (leave-one-out) products like convolutions or other operators in commutative semigroups.","authors":"John L Spouge, Joseph M Ziegelbauer, Mileidy Gonzalez","doi":"10.1186/s13015-020-00178-x","DOIUrl":"https://doi.org/10.1186/s13015-020-00178-x","url":null,"abstract":"Background: Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given <math><mi>n</mi></math> elements <math> <mrow><msub><mi>g</mi> <mn>0</mn></msub> <mo>,</mo> <msub><mi>g</mi> <mn>1</mn></msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub><mi>g</mi> <mrow><mi>n</mi> <mo>-</mo> <mn>1</mn></mrow> </msub> </mrow> </math> in a set <math><mi>G</mi></math> with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products <math> <mrow> <msub> <mover><mrow><mi>g</mi></mrow> <mrow><mo>¯</mo></mrow> </mover> <mi>j</mi></msub> <mo>=</mo> <msub><mi>g</mi> <mn>0</mn></msub> <msub><mi>g</mi> <mn>1</mn></msub> <mo>⋯</mo> <msub><mi>g</mi> <mrow><mi>j</mi> <mo>-</mo> <mn>1</mn></mrow> </msub> <msub><mi>g</mi> <mrow><mi>j</mi> <mo>+</mo> <mn>1</mn></mrow> </msub> <mo>⋯</mo> <msub><mi>g</mi> <mrow><mi>n</mi> <mo>-</mo> <mn>1</mn></mrow> </msub> </mrow> </math> ( <math><mrow><mn>0</mn> <mo>≤</mo> <mi>j</mi> <mo><</mo> <mi>n</mi></mrow> </math> ).Results: This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like <math> <mrow><msub><mi>g</mi> <mfenced><mrow><mi>i</mi> <mo>,</mo> <mi>j</mi></mrow> </mfenced> </msub> <mo>=</mo> <msub><mi>g</mi> <mi>i</mi></msub> <msub><mi>g</mi> <mrow><mi>i</mi> <mo>+</mo> <mn>1</mn></mrow> </msub> <mo>⋯</mo> <msub><mi>g</mi> <mrow><mi>j</mi> <mo>-</mo> <mn>1</mn></mrow> </msub> </mrow> </math> ; its novel downward phase mirrors the upward phase while exploiting the symmetry of <math><msub><mi>g</mi> <mi>j</mi></msub> </math> and its complement <math> <msub> <mover><mrow><mi>g</mi></mrow> <mrow><mo>¯</mo></mrow> </mover> <mi>j</mi></msub> </math> . The algorithm requires storage for <math><mrow><mn>2</mn> <mi>n</mi></mrow> </math> elements of <math><mi>G</mi></math> and only about <math><mrow><mn>3</mn> <mi>n</mi></mrow> </math> products. In contrast, the standard segment tree algorithms require about <math><mi>n</mi></math> products for construction and <math> <mrow><msub><mo>log</mo> <mn>2</mn></msub> <mi>n</mi></mrow> </math> products for calculating each <math> <msub> <mover><mrow><mi>g</mi></mrow> <mrow><mo>¯</mo></mrow> </mover> <mi>j</mi></msub> </math> , i.e., about <math><mrow><mi>n</mi> <msub><mo>log</mo> <mn>2</mn></msub> <mi>n</mi></mrow> </math> products in total; and a naïve quadratic algorithm using <math><mrow><mi>n</mi> <mo>-</mo> <mn>2</mn></mrow> </math> element-by-element products to compute each <math> <msub> <mover><mrow><mi>g</mi></mrow> <mrow><mo>¯</mo></mrow> </mover> <mi>j</mi></msub> </math> requires <math><mrow><mi>n</mi> <mf","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"17"},"PeriodicalIF":1.0,"publicationDate":"2020-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00178-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38415649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On an enhancement of RNA probing data using information theory. 利用信息论增强RNA探测数据。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-08-07 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00176-z

Thomas J X Li, Christian M Reidys

引用次数: 2

Algorithms for the quantitative Lock/Key model of cytoplasmic incompatibility. 细胞质不相容定量锁/键模型的算法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-07-22 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00174-1

Tiziana Calamoneri, Mattia Gastaldello, Arnaud Mary, Marie-France Sagot, Blerina Sinaimeri

{"title":"Algorithms for the quantitative Lock/Key model of cytoplasmic incompatibility.","authors":"Tiziana Calamoneri, Mattia Gastaldello, Arnaud Mary, Marie-France Sagot, Blerina Sinaimeri","doi":"10.1186/s13015-020-00174-1","DOIUrl":"https://doi.org/10.1186/s13015-020-00174-1","url":null,"abstract":"Cytoplasmic incompatibility (CI) relates to the manipulation by the parasite Wolbachia of its host reproduction. Despite its widespread occurrence, the molecular basis of CI remains unclear and theoretical models have been proposed to understand the phenomenon. We consider in this paper the quantitative Lock-Key model which currently represents a good hypothesis that is consistent with the data available. CI is in this case modelled as the problem of covering the edges of a bipartite graph with the minimum number of chain subgraphs. This problem is already known to be NP-hard, and we provide an exponential algorithm with a non trivial complexity. It is frequent that depending on the dataset, there may be many optimal solutions which can be biologically quite different among them. To rely on a single optimal solution may therefore be problematic. To this purpose, we address the problem of enumerating (listing) all minimal chain subgraph covers of a bipartite graph and show that it can be solved in quasi-polynomial time. Interestingly, in order to solve the above problems, we considered also the problem of enumerating all the maximal chain subgraphs of a bipartite graph and improved on the current results in the literature for the latter. Finally, to demonstrate the usefulness of our methods we show an application on a real dataset.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"14"},"PeriodicalIF":1.0,"publicationDate":"2020-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00174-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38186822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fast computation of genome-metagenome interaction effects. 基因组-宏基因组相互作用效应的快速计算。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-07-01 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00173-2

Florent Guinot, Marie Szafranski, Julien Chiquet, Anouk Zancarini, Christine Le Signor, Christophe Mougel, Christophe Ambroise

{"title":"Fast computation of genome-metagenome interaction effects.","authors":"Florent Guinot, Marie Szafranski, Julien Chiquet, Anouk Zancarini, Christine Le Signor, Christophe Mougel, Christophe Ambroise","doi":"10.1186/s13015-020-00173-2","DOIUrl":"https://doi.org/10.1186/s13015-020-00173-2","url":null,"abstract":"Motivation: Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of inherited genetic information, and metagenomic marker which are related to the environment. Both types of markers are available in their millions and can be used to characterize any observation uniquely.Objective: Our focus is on detecting interactions between groups of genetic and metagenomic markers in order to gain a better understanding of the complex relationship between environment and genome in the expression of a given phenotype.Contributions: We propose a novel approach for efficiently detecting interactions between complementary datasets in a high-dimensional setting with a reduced computational cost. The method, named SICOMORE, reduces the dimension of the search space by selecting a subset of supervariables in the two complementary datasets. These supervariables are given by a weighted group structure defined on sets of variables at different scales. A Lasso selection is then applied on each type of supervariable to obtain a subset of potential interactions that will be explored via linear model testing.Results: We compare SICOMORE with other approaches in simulations, with varying sample sizes, noise, and numbers of true interactions. SICOMORE exhibits convincing results in terms of recall, as well as competitive performances with respect to running time. The method is also used to detect interaction between genomic markers in Medicago truncatula and metagenomic markers in its rhizosphere bacterial community.Software availability: An R package is available [4], along with its documentation and associated scripts, allowing the reader to reproduce the results presented in the paper.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"13"},"PeriodicalIF":1.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00173-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38119459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evolution through segmental duplications and losses: a Super-Reconciliation approach. 通过片段复制和损失的进化:一种超级调和方法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-05-26 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00171-4

Mattéo Delabre, Nadia El-Mabrouk, Katharina T Huber, Manuel Lafond, Vincent Moulton, Emmanuel Noutahi, Miguel Sautie Castellanos

{"title":"Evolution through segmental duplications and losses: a Super-Reconciliation approach.","authors":"Mattéo Delabre, Nadia El-Mabrouk, Katharina T Huber, Manuel Lafond, Vincent Moulton, Emmanuel Noutahi, Miguel Sautie Castellanos","doi":"10.1186/s13015-020-00171-4","DOIUrl":"https://doi.org/10.1186/s13015-020-00171-4","url":null,"abstract":"The classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is not appropriate for genes grouped into syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation problem which consists in inferring a history of segmental duplication and loss events (involving a set of neighboring genes) leading to a set of present-day syntenies from a single ancestral one. In other words, we extend the traditional Duplication-Loss reconciliation problem of a single gene tree, to a set of trees, accounting for segmental duplications and losses. Existency of a Super-Reconciliation depends on individual gene tree consistency. In addition, ignoring rearrangements implies that existency also depends on gene order consistency. We first show that the problem of reconstructing a most parsimonious Super-Reconciliation, if any, is NP-hard and give an exact exponential-time algorithm to solve it. Alternatively, we show that accounting for rearrangements in the evolutionary model, but still only minimizing segmental duplication and loss events, leads to an exact polynomial-time algorithm. We finally assess time efficiency of the former exponential time algorithm for the Duplication-Loss model on simulated datasets, and give a proof of concept on the opioid receptor genes.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"12"},"PeriodicalIF":1.0,"publicationDate":"2020-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00171-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38022547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

The distance and median problems in the single-cut-or-join model with single-gene duplications. 具有单基因重复的单切割或连接模型中的距离和中位数问题。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-05-04 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00169-y

Aniket C Mane, Manuel Lafond, Pedro C Feijao, Cedric Chauve

{"title":"The distance and median problems in the single-cut-or-join model with single-gene duplications.","authors":"Aniket C Mane, Manuel Lafond, Pedro C Feijao, Cedric Chauve","doi":"10.1186/s13015-020-00169-y","DOIUrl":"https://doi.org/10.1186/s13015-020-00169-y","url":null,"abstract":"Background: In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model.Results: We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data.Conclusion: Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"8"},"PeriodicalIF":1.0,"publicationDate":"2020-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00169-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37920085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences. 基于序列重采样随机漫步的生物分子序列非参数和半参数支持估计。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-04-16 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00167-0

Wei Wang, Jack Smith, Hussein A Hejase, Kevin J Liu

{"title":"Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences.","authors":"Wei Wang, Jack Smith, Hussein A Hejase, Kevin J Liu","doi":"10.1186/s13015-020-00167-0","DOIUrl":"https://doi.org/10.1186/s13015-020-00167-0","url":null,"abstract":"Non-parametric and semi-parametric resampling procedures are widely used to perform support estimation in computational biology and bioinformatics. Among the most widely used methods in this class is the standard bootstrap method, which consists of random sampling with replacement. While not requiring assumptions about any particular parametric model for resampling purposes, the bootstrap and related techniques assume that sites are independent and identically distributed (i.i.d.). The i.i.d. assumption can be an over-simplification for many problems in computational biology and bioinformatics. In particular, sequential dependence within biomolecular sequences is often an essential biological feature due to biochemical function, evolutionary processes such as recombination, and other factors. To relax the simplifying i.i.d. assumption, we propose a new non-parametric/semi-parametric sequential resampling technique that generalizes \"Heads-or-Tails\" mirrored inputs, a simple but clever technique due to Landan and Graur. The generalized procedure takes the form of random walks along either aligned or unaligned biomolecular sequences. We refer to our new method as the SERES (or \"SEquential RESampling\") method. To demonstrate the performance of the new technique, we apply SERES to estimate support for the multiple sequence alignment problem. Using simulated and empirical data, we show that SERES-based support estimation yields comparable or typically better performance compared to state-of-the-art methods.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"7"},"PeriodicalIF":1.0,"publicationDate":"2020-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00167-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37862476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance. Robinson-Foulds距离下系统发育树补全的线性时间算法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-04-13 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-00166-1

Mukul S Bansal

{"title":"Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance.","authors":"Mukul S Bansal","doi":"10.1186/s13015-020-00166-1","DOIUrl":"https://doi.org/10.1186/s13015-020-00166-1","url":null,"abstract":"Background: We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulting trees have identical leaf sets. This requires the computation of an optimal completion that minimizes the distance between the two resulting trees over all possible completions.Results: We provide optimal linear-time algorithms for both completion problems under the widely-used Robinson-Foulds (RF) distance measure. Our algorithm for the first problem improves the time complexity of the current fastest algorithm from quadratic (in the size of the two trees) to linear. No algorithms have yet been proposed for the more general second problem where both trees have missing leaves. We advance the study of this general problem by proposing a useful restricted version of the general problem and providing optimal linear-time algorithms for the restricted version. Our experimental results on biological data sets suggest that completion-based RF distances can be very different compared to traditional RF distances.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"6"},"PeriodicalIF":1.0,"publicationDate":"2020-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00166-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37853940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

GrpClassifierEC: a novel classification approach based on the ensemble clustering space. GrpClassifierEC:一种新的基于集成聚类空间的分类方法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2020-02-13 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-020-0162-7

Loai Abdallah, Malik Yousef

{"title":"GrpClassifierEC: a novel classification approach based on the ensemble clustering space.","authors":"Loai Abdallah, Malik Yousef","doi":"10.1186/s13015-020-0162-7","DOIUrl":"https://doi.org/10.1186/s13015-020-0162-7","url":null,"abstract":"Background: Advances in molecular biology have resulted in big and complicated data sets, therefore a clustering approach that able to capture the actual structure and the hidden patterns of the data is required. Moreover, the geometric space may not reflects the actual similarity between the different objects. As a result, in this research we use clustering-based space that convert the geometric space of the molecular to a categorical space based on clustering results. Then we use this space for developing a new classification algorithm.Results: In this study, we propose a new classification method named GrpClassifierEC that replaces the given data space with categorical space based on ensemble clustering (EC). The EC space is defined by tracking the membership of the points over multiple runs of clustering algorithms. Different points that were included in the same clusters will be represented as a single point. Our algorithm classifies all these points as a single class. The similarity between two objects is defined as the number of times that these objects were not belong to the same cluster. In order to evaluate our suggested method, we compare its results to the k nearest neighbors, Decision tree and Random forest classification algorithms on several benchmark datasets. The results confirm that the suggested new algorithm GrpClassifierEC outperforms the other algorithms.Conclusions: Our algorithm can be integrated with many other algorithms. In this research, we use only the k-means clustering algorithm with different k values. In future research, we propose several directions: (1) checking the effect of the clustering algorithm to build an ensemble clustering space. (2) Finding poor clustering results based on the training data, (3) reducing the volume of the data by combining similar points based on the EC.Availability and implementation: The KNIME workflow, implementing GrpClassifierEC, is available at https://malikyousef.com.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"3"},"PeriodicalIF":1.0,"publicationDate":"2020-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-0162-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37664645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1