{"title":"A novel random forests-based feature selection method for microarray expression data analysis","authors":"Dengju Yao, Jing Yang, Xiaojuan Zhan, Xiaorong Zhan, Zhiqiang Xie","doi":"10.1504/IJDMB.2015.070852","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070852","url":null,"abstract":"High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070852","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyu Cui, Byungmin Kim, Saud Alguwaizani, Kyungsook Han
{"title":"Assessing protein-protein interactions based on the semantic similarity of interacting proteins","authors":"Guangyu Cui, Byungmin Kim, Saud Alguwaizani, Kyungsook Han","doi":"10.1504/IJDMB.2015.070842","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070842","url":null,"abstract":"The Gene Ontology (GO) has been used in estimating the semantic similarity of proteins since it has the largest and reliable vocabulary of gene products and characteristics. We developed a new method which can assess Protein-Protein Interactions (PPI) using the branching factor and information content of the common ancestor of interacting proteins in the GO hierarchy. We performed a comparative evaluation of the measure with other GO-based similarity measures and evaluation results showed that our method outperformed others in most GO domains.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070842","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrieAMD: a scalable and efficient apriori motif discovery approach","authors":"Isra M. Al-Turaiki, G. Badr, H. Mathkour","doi":"10.1504/IJDMB.2015.070833","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070833","url":null,"abstract":"Motif discovery is the problem of finding recurring patterns in biological sequences. It is one of the hardest and long-standing problems in bioinformatics. Apriori is a well-known data-mining algorithm for the discovery of frequent patterns in large datasets. In this paper, we apply the Apriori algorithm and use the Trie data structure to discover motifs. We propose several modifications so that we can adapt the classic Apriori to our problem. Experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for the average sensitivity measure, which means that our approach is able to discover more motifs. In terms of specificity, the performance of our algorithm is comparable to the other tools. The results also confirm both linear time and linear space scalability of the algorithm.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nilgun Ferhatosmanoglu, T. Allen, Ümit V. Çatalyürek
{"title":"Mitigating bias in planning two-colour microarray experiments","authors":"Nilgun Ferhatosmanoglu, T. Allen, Ümit V. Çatalyürek","doi":"10.1504/IJDMB.2015.070838","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070838","url":null,"abstract":"Two-colour microarrays are used to study differential gene expression on a large scale. Experimental planning can help reduce the chances of wrong inferences about whether genes are differentially expressed. Previous research on this problem has focused on minimising estimation errors (according to variance-based criteria such as A-optimality) on the basis of optimistic assumptions about the system studied. In this paper, we propose a novel planning criterion to evaluate existing plans for microarray experiments. The proposed criterion is 'Generalised-A Optimality' that is based on realistic assumptions that include bias errors. Using Generalised-A Optimality, the reference-design approach is likely to yield greater estimation accuracy in specific situations in which loop designs had previously seemed superior. However, hybrid designs are likely to offer higher estimation accuracy than reference, loop and interwoven designs having the same number of samples and slides. These findings are supported by data from both simulated and real microarray experiments.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070838","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An integrated strategy for functional analysis of microbial communities based on gene ontology and 16S rRNA gene","authors":"Suping Deng, De-shuang Huang","doi":"10.1504/IJDMB.2015.070841","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070841","url":null,"abstract":"In order to analyse the similarity among microbial communities on functional state after assigning 16S rRNA sequences from all microbial communities to species. It's an important addition to the species-level relationship between two compared communities and can quantify their differences in function. We downloaded all functional annotation data of several microbiotas. It's developed to identify the functional distribution and the significantly enriched functional categories of microbial communities. We analysed the similarity between two microbial communities on functional state. In the experimental results, it shows that the semantic similarity can quantify the difference between two compared species on function level. It can analyse the function of microbial communities by gene ontology based on 16S rRNA gene. Exploration of the function relationship between two sets of species assemblages will be a key result of microbiome studies and may provide new insights into assembly of a wide range of ecosystems.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070841","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene function prediction with knowledge from gene ontology","authors":"Ying Shen, Lin Zhang","doi":"10.1504/IJDMB.2015.070840","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070840","url":null,"abstract":"Gene function prediction is an important problem in bioinformatics. Due to the inherent noise existing in the gene expression data, the attempt to improve the prediction accuracy resorting to new classification techniques is limited. With the emergence of Gene Ontology (GO), extra knowledge about the gene products can be extracted from GO and facilitates solving the gene function prediction problem. In this paper, we propose a new method which utilises GO information to improve the classifiers' performance in gene function prediction. Specifically, our method learns a distance metric under the supervision of the GO knowledge using the distance learning technique. Compared with the traditional distance metrics, the learned one produces a better performance and consequently classification accuracy can be improved. The effectiveness of our proposed method has been corroborated by the extensive experimental results.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070840","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DNA sequence and structure properties analysis reveals similarities and differences to promoters of stress responsive genes in Arabidopsis thaliana","authors":"P. Zhu, Yanhong Zhou, Libin Zhang, Chuang Ma","doi":"10.1504/IJDMB.2015.070832","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070832","url":null,"abstract":"Understanding regulatory mechanisms of stress response in plants has important biological and agricultural significances. In this study, we firstly compiled a set of genes responsive to different stresses in Arabidopsis thaliana and then comparatively analysed their promoters at both the DNA sequence and three-dimensional structure levels. Amazingly, the comparison revealed that the profiles of several sequence and structure properties vary distinctly in different regions of promoters. Moreover, the content of nucleotide T and the profile of B-DNA twist are distinct in promoters from different stress groups, suggesting Arabidopsis genes might exploit different regulatory mechanisms in response to various stresses. Finally, we evaluated the performance of two representative promoter predictors including EP3 and PromPred. The evaluation results revealed their strengths and weakness for identifying stress-related promoters, providing valuable guidelines to accelerate the discovery of novel stress-related promoters and genes in plants.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070832","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdollah Dehzangi, Alok Sharma, James Lyons, Kuldip K Paliwal, Abdul Sattar
{"title":"A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition.","authors":"Abdollah Dehzangi, Alok Sharma, James Lyons, Kuldip K Paliwal, Abdul Sattar","doi":"10.1504/ijdmb.2015.066359","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.066359","url":null,"abstract":"<p><p>Recent advancement in the pattern recognition field stimulates enormous interest in Protein Fold Recognition (PFR). PFR is considered as a crucial step towards protein structure prediction and drug design. Despite all the recent achievements, the PFR still remains as an unsolved issue in biological science and its prediction accuracy still remains unsatisfactory. Furthermore, the impact of using a wide range of physicochemical-based attributes on the PFR has not been adequately explored. In this study, we propose a novel mixture of physicochemical and evolutionary-based feature extraction methods based on the concepts of segmented distribution and density. We also explore the impact of 55 different physicochemical-based attributes on the PFR. Our results show that by providing more local discriminatory information as well as obtaining benefit from both physicochemical and evolutionary-based features simultaneously, we can enhance the protein fold prediction accuracy up to 5% better than previously reported results found in the literature.</p>","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.066359","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33973465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anindya Bhattacharya, Nirmalya Chowdhury, Rajat K De
{"title":"Concepts of relative sample outlier (RSO) and weighted sample similarity (WSS) for improving performance of clustering genes: co-function and co-regulation.","authors":"Anindya Bhattacharya, Nirmalya Chowdhury, Rajat K De","doi":"10.1504/ijdmb.2015.067322","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.067322","url":null,"abstract":"<p><p>Performance of clustering algorithms is largely dependent on selected similarity measure. Efficiency in handling outliers is a major contributor to the success of a similarity measure. Better the ability of similarity measure in measuring similarity between genes in the presence of outliers, better will be the performance of the clustering algorithm in forming biologically relevant groups of genes. In the present article, we discuss the problem of handling outliers with different existing similarity measures and introduce the concepts of Relative Sample Outlier (RSO). We formulate new similarity, called Weighted Sample Similarity (WSS), incorporated in Euclidean distance and Pearson correlation coefficient and then use them in various clustering and biclustering algorithms to group different gene expression profiles. Our results suggest that WSS improves performance, in terms of finding biologically relevant groups of genes, of all the considered clustering algorithms.</p>","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.067322","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34039166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunghan Kim, Fabien Scalzo, Donatello Telesca, Xiao Hu
{"title":"Ensemble of sparse classifiers for high-dimensional biological data.","authors":"Sunghan Kim, Fabien Scalzo, Donatello Telesca, Xiao Hu","doi":"10.1504/ijdmb.2015.069416","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.069416","url":null,"abstract":"<p><p>Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.</p>","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.069416","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34123510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}