{"title":"HMMeta","authors":"Sola Gbenro, Kyle Hippe, Renzhi Cao","doi":"10.1145/3388440.3414702","DOIUrl":"https://doi.org/10.1145/3388440.3414702","url":null,"abstract":"As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis of protein function has never been more important. In this research, we introduce a novel protein function prediction method HMMeta, which is based on the prominent natural language prediction technique Hidden Markov Models (HMM). With a new representation of protein sequence as a language, we trained a unique HMM for each Gene Ontology (GO) term taken from the UniProt database, which in total has 27,451 unique GO IDs leading to the creation of 27,451 Hidden Markov Models. We employed data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount in the database, and this helped to balance the number of protein sequences associated with each GO term. Predictions are made by running the sequence against each model created. The models within eighty percent of the top scoring model, or 75 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence. We benchmarked our method in the latest Critical Assessment of protein Function Annotation (CAFA 4) experiment as CaoLab2, and we also evaluated HMMeta against several other protein function prediction methods against a subset of the UniProt database. HMMeta achieved favorable results as a sequence-based method, and outperforms a few notable methods in some categories through our evaluation, which shows great potential for automated protein function prediction. The tool is available at https://github.com/KPHippe/HMM-For-Protein-Prediction.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115556614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Refinement of G protein-coupled receptor structure models: Improving the prediction of loop conformations and the virtual ligand screening performances","authors":"Bhumika Arora","doi":"10.1145/3388440.3414920","DOIUrl":"https://doi.org/10.1145/3388440.3414920","url":null,"abstract":"G protein-coupled receptors (GPCRs) constitute the largest superfamily of membrane proteins. They mediate most of the physiological processes of the human body and form the largest group of potential drug targets. Therefore, knowledge of their three-dimensional structure is important for structure-based drug design. Due to the limited availability of the experimental structures of GPCRs, computational methods are often used for deriving the structural information. GPCRs have a common structural topology that is comprised of seven transmembrane helices interconnected by intra- and extracellular loops. Homology modeling is the computational approach that is commonly used for modeling the transmembrane helical domains of GPCRs. Depending upon the quality of template used, these homology models exhibit varying degrees of inaccuracies. We have previously explored the extent to which inaccuracies present in homology models of the transmembrane helical domains of GPCRs can affect loop prediction [1]. We have also investigated the effect of presence and absence of other extracellular loops on individual loop modeling. We found that loop prediction in GPCR models is much more difficult than loop reconstruction in crystal structures because of the imprecise positioning of loop anchors in the models, although modeling an extracellular loop in the presence of other extracellular loops helps in improving the accuracy of its prediction. Therefore, reducing the errors in loop anchors is crucial for GPCR structure prediction. To address this and to improve the usability of GPCR homology models for structure-based drug design, we have developed a Ligand Directed Modeling (LDM) method that involves geometric protein sampling and ligand docking. The method was evaluated for capacity to refine the GPCR models built across a range of templates with varying degrees of sequence similarity with the target. LDM reduced the errors in loop anchor positions and improved the performance of these models in virtual ligand screenings. Thus, this Ligand Directed Modeling method is efficient in improving the quality of GPCR structure models.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Antoniadi, M. Galvin, M. Heverin, O. Hardiman, C. Mooney
{"title":"Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis","authors":"A. Antoniadi, M. Galvin, M. Heverin, O. Hardiman, C. Mooney","doi":"10.1145/3388440.3414908","DOIUrl":"https://doi.org/10.1145/3388440.3414908","url":null,"abstract":"The aim of this study is to create a Clinical Decision Support System (CDSS) to assist in the early identification and support of caregivers at risk of experiencing burden while caring for a person with Amyotrophic Lateral Sclerosis. We work towards a system that uses a minimum amount of data that could be routinely collected. We investigated if the impairment of patients alone provides sufficient information for the prediction of caregiver burden. Results reveal a better performance of our system in identifying those at risk of high burden, but more information is needed for an accurate CDSS.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114718659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProLanGO2","authors":"Kyle Hippe, Sola Gbenro, Renzhi Cao","doi":"10.1145/3388440.3414701","DOIUrl":"https://doi.org/10.1145/3388440.3414701","url":null,"abstract":"Predicting protein function from protein sequence is a main challenge in the computational biology field. Traditional methods that search protein sequences against existing databases may not work well in practice, particularly when little or no homology exists in the database. We introduce the ProLanGO2 method which utilizes the natural language processing and machine learning techniques to tackle the protein function prediction problem with protein sequence as input. Our method has been benchmarked blindly in the latest Critical Assessment of protein Function Annotation algorithms (CAFA 4) experiment. There are a few changes compared to the old version of ProLanGO. First of all, the latest version of the UniProt database is used. Second, the Uniprot database is filtered by the newly created fragment sequence database FSD to prepare for the protein sequence language. Third, the Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset. Fourth, if no k-mers of a protein sequence exist in the FSD, we select the top ten GO terms with the highest probability in all sequences from the Uniprot database that didn't contain any k-mers in FSD, and use those ten GO terms as back up for the prediction of new protein sequence. Finally, we selected the 100 best performing models and explored all combinations of those models to select the best performance ensemble model. We benchmark those different combinations of models on CAFA 3 dataset and select three top performance ensemble models for prediction in the latest CAFA 4 experiment as CaoLab. We have also evaluated the performance of our ProLanGO2 method on 253 unseen sequences taken from the UniProt database and compared with several other protein function prediction methods, the results show that our method achieves great performance among sequence-based protein function prediction methods. Our method is available in GitHub: https://github.com/caorenzhi/ProLanGO2.git.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117261141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Divide and Conquer Algorithm for Electron Microscopy Segmentation","authors":"Ruba Jebril, Yingde Zhu, Wei Chen, K. Al Nasr","doi":"10.1145/3388440.3414700","DOIUrl":"https://doi.org/10.1145/3388440.3414700","url":null,"abstract":"Cryo-Electron Microscopy is a biophysical technique able to visualize macromolecular complexes by producing 3-dimensional images. Currently, it has been advanced to be the second popular technique to construct protein molecules in terms of the number of structures released annually. The main advantages of cryo-electron microscopy are its ability to visualize large molecules, molecules that are hard to crystalize in their native environment. One critical step to construct the structure of a molecule from cryo-electron microscopy is to divide the image into regions for the chains/subunits that make up the molecule/complex. If the image is accurately segmented into the correct regions, the process of modelling using existing tools become easier and faster. In this paper, we developed a divide-and-conquer algorithm to segment a given cryo-electron microscopy image efficiently. Our approach is based on the popular watershed algorithm. We tested our method on 10 authentic images and compared it with Segger. Although, it is difficult to conduct an accurate comparison, the results show that the performance of our algorithm is competitive when compared to Segger.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117220596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aditya Pratapa, A. Jalihal, Jeffrey N. Law, Aditya Bharadwaj, T. Murali
{"title":"How to build regulatory networks from single-cell gene expression data","authors":"Aditya Pratapa, A. Jalihal, Jeffrey N. Law, Aditya Bharadwaj, T. Murali","doi":"10.1145/3388440.3414213","DOIUrl":"https://doi.org/10.1145/3388440.3414213","url":null,"abstract":"Over a dozen methods have been developed to infer gene regulatory networks (GRNs) from single-cell RNA-seq data. An experimentalist seeking to analyze a new dataset faces a daunting task in selecting an appropriate inference method since there are no widely accepted ground-truth datasets for assessing algorithm accuracy and the criteria for evaluation and comparison of methods are varied. We have developed BEELINE, a comprehensive evaluation of state-of-the-art algorithms for inferring GRNs from single-cell transcriptomic data [1]. BEELINE incorporates 12 diverse algorithms for GRN inference. It provides an easy-to-use and uniform interface to each method in the form of a Docker image. BEELINE implements several measures for estimating and comparing the accuracy, stability, and efficiency of these algorithms. Thus, BEELINE facilitates reproducible, rigorous, and extensible evaluations of GRN inference methods. We selected (a) synthetic networks with predictable cellular trajectories, (b) literature-curated Boolean models, and (c) diverse transcriptional regulatory and functional interaction networks to serve as the ground truth for evaluating the accuracy of GRN inference algorithms. We developed a strategy to simulate single-cell gene expression data from the first two types of networks. We used multiple experimental single-cell RNA-seq datasets in conjunction with the third type of network. Our evaluations suggest that the area under the precision-recall curve and early precision of these algorithms are moderate. Techniques that do not require pseudotime-ordered cells are generally more accurate. Based on these results, we present recommendations to end users of GRN inference methods. Finally, we discuss the potential of supervised algorithms for GRN inference.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125880074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yixing Jiang, Kristen Alford, Frank Ketchum, L. Tong, May D. Wang
{"title":"TLSurv","authors":"Yixing Jiang, Kristen Alford, Frank Ketchum, L. Tong, May D. Wang","doi":"10.1145/3388440.3412422","DOIUrl":"https://doi.org/10.1145/3388440.3412422","url":null,"abstract":"Lung cancer is one of the leading cancers, but survival models have not been explored to the extent of other cancers like breast cancer. In this study, we develop a super-hybrid network called TLSurv to integrate Copy Number Variation, DNA methylation, mRNA expression, and miRNA expression data for TCGA-LUAD datasets. The modularity of this super-hybrid network allows the integration of multiple -omics modalities with tremendous dimensional differences. Additionally, a novel training scheme called multi-stage transfer learning is used to train this super-hybrid network incrementally. This allows for training of a large network with many subnetworks using a relatively small data sets. At each stage, a shallow subnetwork is trained and these networks are combined to form a powerful prediction network. The results show the combination of DNA methylation data with either mRNA or miRNA expression data has produced promising performances with C-indexes of around 0.7. This performance is better than previous studies. Interpretability analysis confirms the clinical significance of some biomarkers identified. In addition, some novel biomarkers are suggested for future medical research. These findings reveal the potential of super-hybrid network for integrating multiple data modalities and the potential of multi-stage transfer learning for addressing the \"curse of dimensionality.\"","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqi Shi, L. Tong, Yuchen Zhuang, Yuanda Zhu, May D. Wang
{"title":"EXAM: An Explainable Attention-based Model for COVID-19 Automatic Diagnosis","authors":"Wenqi Shi, L. Tong, Yuchen Zhuang, Yuanda Zhu, May D. Wang","doi":"10.1145/3388440.3412455","DOIUrl":"https://doi.org/10.1145/3388440.3412455","url":null,"abstract":"The ongoing coronavirus disease 2019 (COVID-19) is still rapidly spreading and has caused over 7,000,000 infection cases and 400,000 deaths around the world. To come up with a fast and reliable COVID-19 diagnosis system, people seek help from machine learning area to establish computer-aided diagnosis systems with the aid of the radiological imaging techniques, like X-ray imaging and computed tomography imaging. Although artificial intelligence based architectures have achieved great improvements in performance, most of the models are still seemed as a black box to researchers. In this paper, we propose an Explainable Attention-based Model (EXAM) for COVID-19 automatic diagnosis with convincing visual interpretation. We transform the diagnosis process with radiological images into an image classification problem differentiating COVID-19, normal and community-acquired pneumonia (CAP) cases. Combining channel-wise and spatial-wise attention mechanism, the proposed approach can effectively extract key features and suppress irrelevant information. Experiment results and visualization indicate that EXAM outperforms recent state-of-art models and demonstrate its interpretability.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127919170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan
{"title":"Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data","authors":"Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan","doi":"10.1145/3388440.3414861","DOIUrl":"https://doi.org/10.1145/3388440.3414861","url":null,"abstract":"Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"277 19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126708844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CanMod","authors":"Duc Do, S. Bozdag","doi":"10.1145/3388440.3415586","DOIUrl":"https://doi.org/10.1145/3388440.3415586","url":null,"abstract":"Transcription factors (TFs) and microRNAs (miRNAs) are two important classes of gene regulators that govern many critical biological processes. Dysregulation of TF-gene and miRNA-gene interactions can lead to the development of multiple diseases including cancer. Many studies aimed to identify interactions between target genes and their regulators in both normal and disease settings. However, few studies attempted to elucidate the collaborative relationship between TFs and miRNAs in regulating genes involved in cancer-associated biological processes. Identification of the co-regulatory functions of those regulators in cancer would provide a better understanding of gene regulation at different layers and may also suggest better approaches for targeted therapy. This study proposes a computational pipeline called CanMod to identify cancer-associated gene regulatory modules. CanMod was designed so that it could infer gene regulatory modules that meet three criteria. First, within a module, target genes should involve in similar biological processes; thus, the modules are distinguishable based on their biological functions. Second, the expression of target genes in a module should be collectively dependent on the expression of their regulators. Third, a regulator and a target should be allowed to be included in multiple modules to reflect the diverse biological roles that the genes and the regulators may be responsible for. CanMod also incorporates other regulatory factors such as copy number alteration and DNA methylation data to infer regulator-target gene interactions with higher accuracy. We applied CanMod on the breast cancer dataset (BRCA) from The Cancer Genome Atlas (TCGA). We found that modules found by CanMod were associated with distinguishable biological functions and the expression of target genes in the modules were significantly correlated. In addition, many hub regulators in CanMod were known cancer genes, and CanMod was able to find experimentally validated regulator-target interactions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114201974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}