{"title":"Bayesian Collective Markov Random Fields for Subcellular Localization Prediction of Human Proteins","authors":"Lu Zhu, M. Ester","doi":"10.1145/3107411.3107412","DOIUrl":"https://doi.org/10.1145/3107411.3107412","url":null,"abstract":"Advanced biotechnology makes it possible to access a multitude of heterogeneous proteomic, interactomic, genomic, and functional annotation data. One challenge in computational biology is to integrate these data to enable automated prediction of the Subcellular Localizations (SCL) of human proteins. For proteins that have multiple biological roles, their correct in silico assignment to different SCL can be considered as an imbalanced multi-label classification problem. In this study, we developed a Bayesian Collective Markov Random Fields (BCMRFs) model for multi-SCL prediction of human proteins. Given a set of unknown proteins and their corresponding protein-protein interaction (PPI) network, the SCLs of each protein can be inferred by the SCLs of its interacting partners. To do so, we integrate PPIs, the adjacency of SCLs and protein features, and perform transductive learning on the re-balanced dataset. Our experimental results show that the spatial adjacency of the SCLs improves multi-SCL prediction, especially for the SCLs with few annotated instances. Our approach outperforms the state-of-art PPI-based and feature-based multi-SCL prediction method for human proteins.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-PCC: A GPU Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Big fMRI Data","authors":"Taban Eslami, M. Awan, F. Saeed","doi":"10.1145/3107411.3108173","DOIUrl":"https://doi.org/10.1145/3107411.3108173","url":null,"abstract":"Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain's functional activities. Pearson's Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson's Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU-PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62x faster as compared to the CPU version and 4.28x faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115044675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Residual Nets for Improved Alzheimer's Diagnosis","authors":"Aly A. Valliani, Ameet Soni","doi":"10.1145/3107411.3108224","DOIUrl":"https://doi.org/10.1145/3107411.3108224","url":null,"abstract":"We propose a framework that leverages deep residual CNNs pretrained on large, non-biomedical image data sets. These pretrained networks learn cross-domain features that improve low-level interpretation of images. We evaluate our model on brain imaging data and show that pretraining and the use of deep residual networks are crucial to seeing large improvements in Alzheimer's Disease diagnosis from brain MRIs.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Protein Chain Isolation from 3D Cryo-EM Data and Volume Comparison Tool","authors":"Michael Nissenson, Dong Si","doi":"10.1145/3107411.3107500","DOIUrl":"https://doi.org/10.1145/3107411.3107500","url":null,"abstract":"In electron cryo-microscopy (cryo-EM), manual isolation of volumetric protein density map data surrounding known protein structures is a time-consuming process that requires constant expert attention for multiple hours. This paper presents a tool, Volume Cut, and an algorithm to automatically isolate the volumetric data surrounding individual protein chains from the entire macro-molecular complex that runs in just minutes. This tool can be used in the data collection and data pre-processing steps to generate good training datasets of single chain volume-structure pairs, which can be further used for the study of protein structure prediction from experimental 3D cryo-EM density maps using data mining and machine learning. Additionally, an application of this tool was explored in depth that compares the cut experimental cryo-EM data with simulated data in an attempt to find irregularities of experimental data for the purpose of validation. The source for both tools can be found at https://github.com/nissensonm/VolumeCut/.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125898721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GOstruct 2.0: Automated Protein Function Prediction for Annotated Proteins","authors":"Indika Kahanda, A. Ben-Hur","doi":"10.1145/3107411.3107417","DOIUrl":"https://doi.org/10.1145/3107411.3107417","url":null,"abstract":"Automated Protein Function Prediction is the task of automatically predicting functional annotations for a protein based on gold-standard annotations derived from experimental assays. These experiment-based annotations accumulate over time: proteins without annotations get annotated, and new functions of already annotated proteins are discovered. Therefore, function prediction can be considered a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In previous work, we analyzed the performance of several protein function prediction methods in these two scenarios. Our results showed that GOstruct, which is based on the structured output framework, had lower accuracy in the task of predicting annotations for proteins with existing annotations, while its performance on un-annotated proteins was similar to the performance in cross-validation. In this work, we present GOstruct 2.0 which includes improvements that allow the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. This is highly important for model organisms where most proteins have some level of annotations. Experimental results on human data show that GOstruct 2.0 outperforms the original GOstruct in this task, demonstrating the effectiveness of the proposed improvements. This is the first study that focuses on adapting the structured output framework for applications in which labels are incomplete by nature.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129382818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Hyperparameter Optimization for Machine Learning Based eQTL Analysis","authors":"Andrew Quitadamo, James Johnson, Xinghua Shi","doi":"10.1145/3107411.3107434","DOIUrl":"https://doi.org/10.1145/3107411.3107434","url":null,"abstract":"Machine learning methods are being applied to a wide range of problems in biology and bioinformatics. These methods often rely on configuring high level parameters, or hyperparameters, such as regularization hyperparameters in sparse learning models like graph-guided multitask Lasso methods. Different choices for these hyperparameters will lead to different results, which makes finding good hyperparameter combinations an important task when using these hyperparameter dependent methods. There are several different ways to tune hyperparameters including manual tuning, grid search, random search, and Bayesian optimization. In this paper, we apply three hyperparameter tuning strategies to eQTL analysis including grid and random search in addition to Bayesian optimization. Experiments show that the Bayesian optimization strategy outperforms the other strategies in modeling eQTL associations. Applying this strategy to assess eQTL associations using the 1000 Genomes structural variation genotypes and RNAseq data in gEUVADIS, we identify a set of new SVs associated with gene expression changes in a human population.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-view Deep Learning Method for Epileptic Seizure Detection using Short-time Fourier Transform","authors":"Ye Yuan, Guangxu Xun, Ke-bin Jia, Aidong Zhang","doi":"10.1145/3107411.3107419","DOIUrl":"https://doi.org/10.1145/3107411.3107419","url":null,"abstract":"With the advances in pervasive sensor technologies, physiological signals can be captured continuously to prevent the serious outcomes caused by epilepsy. Detection of epileptic seizure onset on collected multi-channel electroencephalogram (EEG) has attracted lots of attention recently. Deep learning is a promising method to analyze large-scale unlabeled data. In this paper, we propose a multi-view deep learning model to capture brain abnormality from multi-channel epileptic EEG signals for seizure detection. Specifically, we first generate EEG spectrograms using short-time Fourier transform (STFT) to represent the time-frequency information after signal segmentation. Second, we adopt stacked sparse denoising autoencoders (SSDA) to unsupervisedly learn multiple features by considering both intra and inter correlation of EEG channels, denoted as intra-channel and cross-channel features, respectively. Third, we add an SSDA-based channel selection procedure using proposed response rate to reduce the dimension of intra-channel feature. Finally, we concatenate the learned multi-features and apply a fully-connected SSDA model with softmax classifier to jointly learn the cross-patient seizure detector in a supervised fashion. To evaluate the performance of the proposed model, we carry out experiments on a real world benchmark EEG dataset and compare it with six baselines. Extensive experimental results demonstrate that the proposed learning model is able to extract latent features with meaningful interpretation, and hence is effective in detecting epileptic seizure.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128949886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels
{"title":"Analysis of 16S Genomic Data using Graphical Databases","authors":"O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels","doi":"10.1145/3107411.3108208","DOIUrl":"https://doi.org/10.1145/3107411.3108208","url":null,"abstract":"Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as \"common nodes\" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of \"compressive genomics\", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132154873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection of Differential Abundance Intervals in Longitudinal Metagenomic Data Using Negative Binomial Smoothing Spline ANOVA","authors":"Ahmed A. Metwally, P. Finn, Yang Dai, D. Perkins","doi":"10.1145/3107411.3107429","DOIUrl":"https://doi.org/10.1145/3107411.3107429","url":null,"abstract":"Metagenomic longitudinal studies have become a widely-used study design to investigate the dynamics of the microbial ecological systems and their temporal effects. One of the important questions to be addressed in longitudinal studies is the identification of time intervals when microbial features show changes in their abundance. We propose a statistical method that is based on a semi-parametric Smoothing Spline ANOVA and negative binomial distribution to model the time-course of the features between two phenotypes. We demonstrate the superior performance of our proposed method compared to the two currently existing methods using simulated data. We present the analysis results of our proposed method in an analysis of a longitudinal dataset that investigates the association between the development of type 1 diabetes in infants and the gut microbiome. The identified significant species and their specific time intervals reveal new information that can be used in improving intervention or treatment plans.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130942891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Controls in ChIP-seq","authors":"Aseel Awdeh, T. Perkins","doi":"10.1145/3107411.3108230","DOIUrl":"https://doi.org/10.1145/3107411.3108230","url":null,"abstract":"The chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) method, initially introduced a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome in various cell lines. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to detect background signal, whilst the ChIP-seq experiment captures the true binding or histone modification signal. However, a recurrent issue is the existence of noise and bias in the controls themselves, as well as different types of bias in ChIP-seq experiments. Thus, depending on which controls are used, peak calling can produce different results (i.e., binding site positions) for the same ChIP-seq experiment. Consequently, generating \"smart\" controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and thus increase the reliability and reproducibility of the results. Our analysis aims to improve our understanding of ChIP-seq controls and their biases. We use unsupervised clustering and dimensionality reduction techniques to compare 160 controls for the K562 cell line in the ENCODE project, finding distincting groupings of controls which correlate to experimental characteristics. To customize a control for each ChIP-seq experiment, we use LASSO regression to fit a sparse set of controls to each of 500 ChIP-seq experiments (again, from ENCODE data for the K562 cell line). We look at how many controls are selected, which controls are used per ChIP-seq experiment, and how they are related to the different ChIP-seq experiment characteristics. Perhaps most surprisingly, we find that the LASSO models are not particularly sparse, often including half of the possible controls to model any given ChIP-seq. Cross-validation as well as testing with smaller sets of candidate controls proves that such large numbers of controls are beneficial for modeling ChIP-seq background distributions. We also observe clusters of ChIP-seq experiments that tend to rely on clusters of controls, and we look at the experimental characteristics that tend to cause a given control to be useful in modeling the background of a given ChIP-seq experiment. Through these analyses, we attempt to answer largely-unstudied questions regarding how much control data and of what types are useful in ChIP-seq analysis, and how suitable controls can be matched to ChIP-seq datasets.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129288684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}