M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman
{"title":"Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman","doi":"10.1145/3233547.3233657","DOIUrl":"https://doi.org/10.1145/3233547.3233657","url":null,"abstract":"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122509419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran
{"title":"KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage","authors":"S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran","doi":"10.1145/3233547.3233587","DOIUrl":"https://doi.org/10.1145/3233547.3233587","url":null,"abstract":"The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at urlhttps://github.com/srbehera11/KmerEstimate.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122982487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of Non-invasive Cytokine Biomarkers for Polycystic Ovary Syndrome Using Supervised Machine Learning","authors":"D. S. Perry, J. Gunawardena, N. Orsi","doi":"10.1145/3233547.3233611","DOIUrl":"https://doi.org/10.1145/3233547.3233611","url":null,"abstract":"Polycystic ovary syndrome (PCOS) is a common endocrine disorder that affects up to 20% of women, however diagnosis is commonly unreliable and un-quantitative. Here we use supervised machine learning and measurements of 51 cytokines from a large cohort of patients to identify a low-dimensional set of potential biomarkers for diagnosis of PCOS. Both whole blood and individual follicular fluid (FF) aspirates were collected women during pre- intracytoplasmic sperm injection with in vitro fertilization (ICSI/IVF) oocyte retrieval and linked with patients' PCOS status as diagnosed by the Rotterdam criteria (n = 69 PCOS, n = 222 non-PCOS). We trained a binary support vector machine (SVM) using a random subset of patient data to determine cytokine profile associated with PCOS. Our resultant model includes 3 variables and is 76% accurate. This provides insight into the immunological basis of PCOS and may define a potential non-invasive quantitative strategy for diagnosis.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125157505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Chen, S. Rust, Enju Lin, Simon M. Lin, Leslie Nelson, L. Alfano, L. Lowes
{"title":"Prediction of Clinical Outcomes of Spinal Muscular Atrophy Using Motion Tracking Data and Elastic Net Regression","authors":"David Chen, S. Rust, Enju Lin, Simon M. Lin, Leslie Nelson, L. Alfano, L. Lowes","doi":"10.1145/3233547.3233572","DOIUrl":"https://doi.org/10.1145/3233547.3233572","url":null,"abstract":"Spinal muscular atrophy (SMA) is a common muscle disease that can lead to high rate of infant mortality. It is important to be able to quickly and accurately diagnose SMAs as well as track disease progression throughout the treatment process. This study introduced a framework for deriving movement features from motion tracking data, and applied a regularized regression method to predict the gold standard clinical measures for SMA, the CHOP INTEND Extremities Scores (CIES). Our results showed the CIES could be predicted with good accuracy using derived motion features and Elastic Net regression. An RMSE of 8.5 points on CIES was achieved in both cross-validation and prediction on the held-out set. A high ROC-AUC of 0.91 was achieved for discriminating SMA infants from Controls on both session and subject levels. It was concluded that motion tracking devices could potentially be used as a low-cost yet effective method to assess and monitor infants with SMA.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115215407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Chromosomal Inversions from Dense SNPs by Combining PCA and Association Tests","authors":"R. J. Nowling, S. Emrich","doi":"10.1145/3233547.3233571","DOIUrl":"https://doi.org/10.1145/3233547.3233571","url":null,"abstract":"Principal Component Analysis (PCA) of dense single nucleotide polymorphism (SNP) data has wide-ranging applications in populations genetics, including detection of chromosomal inversions. SNPs associated with each PC can be identified through single-SNP association tests performed between SNP genotypes and PC coordinates; this approach has several advantages over thresholding loading factors or sparse PCA methods. Insect vector SNP data often have a high proportion of unknown (uncalled) genotypes, however, that cannot be reliably imputed and prevent the direct usage of association tests. Building on our previous work, we propose a novel method for adjusting the association tests to handle these unknown genotypes. We demonstrate the utility of the method through two applications: detecting chromosomal inversions and characterizing differentiation processed captured by PCA. When applied to SNP data from the 2L and 2R chromosome arms of 34 karyotyped Anopheles gambiae and Anopheles coluzzii mosquitoes, our method clearly identifies the 2La, 2Rb, 2Rc, 2Rj, and 2Ru inversions. Using our method to identify SNP associated with 2L-PC3, we observed one of the two insecticide-resistance variants in the Rdl gene; our results suggests that the PC is capturing differentiation driven by insecticide usage.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121410571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"clustQ","authors":"R. Alapati, Debswapna Bhattacharya","doi":"10.1145/3233547.3233570","DOIUrl":"https://doi.org/10.1145/3233547.3233570","url":null,"abstract":"Structure of a protein largely determines its functional properties. Hence, the knowledge of the protein's 3D structure is an important aspect in determining solutions to fundamental biological problems. Structure prediction algorithms generally employ clustering algorithm to select the optimal model for a target from a large number of predicted confirmations (a.k.a. decoy). Despite significant advancement in clustering-based optimal decoy selection methods, these approaches often cannot deliver high performance in terms of the time taken to cluster large number of protein structures owing to the computational cost associated with pairwise structural superpositions. Here, we propose a superposition-free approach to protein decoy clustering, called clustQ, based on weighted internal distance comparisons. Experimental results suggest that the novel weighing scheme is helpful in both reproducing the decoy-native similarity score and estimating pairwise clustering based predicted quality score in a computationally efficient manner. clustQ attains performance comparable to the state-of-the-art multi-model decoy quality estimation methods participating in the latest Critical Assessment of protein Structure Prediction (CASP) experiments irrespective of target difficulty. Moreover, clustQ predicted score offers a unique way to reliably estimate target difficulty without the knowledge of the experimental structure. clustQ is freely available at http://watson.cse.eng.auburn.edu/clustQ/.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115888453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Use of the Informatics for Integrating Biology and the Bedside (i2b2) Population to Test Serum Bilirubin Levels and Risk for Inflammatory Bowl Diseases and the Involvement of Uridine Glucuronosyltransferase Genes","authors":"C. Gallagher","doi":"10.1145/3233547.3233638","DOIUrl":"https://doi.org/10.1145/3233547.3233638","url":null,"abstract":"Chronic inflammation associated with inflammatory bowel disease (IBD) results in increased oxidative stress that damages the colonic microenvironment. A low level of serum bilirubin, an endogenous antioxidant, has been associated with increased risk for Crohn's disease (CD), but no study has tested another common IBD ulcerative colitis (UC). Bilirubin is metabolized in the liver by uridine glucuronosyltransferase 1A1 (UGT1A1) exclusively. Genetic variants cause functional changes in UGT1A1 which result in hyperbilirubinemia, which can be toxic to tissues if untreated and results in a characteristic jaundiced appearance. Approximately 10% of the Caucasian population is homozygous for the microsatellite polymorphism UGT1A1*28, which results in increased total serum bilirubin levels due to reduced transcriptional efficiency of UGT1A1 and an overall 70% reduction in UGT1A1 enzymatic activity. The aim of this study was to examine whether bilirubin levels are associated with the risk for ulcerative colitis (UC). Using the Informatics for Integrating Biology and the Bedside (i2b2), a large case-control population was identified from a single tertiary care center, Penn State Hershey Medical Center (PSU). Similarly, a validation cohort was identified at Virginia Commonwealth University Medical Center. Logistic regression analysis was performed to determine the risk of developing UC with lower concentrations of serum bilirubin. From the PSU cohort, a subset of terminal ileum tissue was obtained at the time of surgical resection to analyze UGT1A1 gene expression (which encodes the enzyme responsible for bilirubin metabolism). Similar to CD patients, UC patients also demonstrated reduced levels of total serum bilirubin. Upon segregating serum bilirubin levels into quartiles, risk of UC increased with reduced concentrations of serum bilirubin. These results were confirmed in our validation cohort. UGT1A1 gene expression was up-regulated in the terminal ileum of a subset of UC patients. Lower levels of the antioxidant bilirubin may reduce the capability of UC patients to remove reactive oxygen species leading to an increase in intestinal injury. One potential explanation for these lower bilirubin levels may be up-regulation of UGT1A1 gene expression, which encodes the only enzyme involved in conjugating bilirubin. Therapeutics that reduce oxidative stress may be beneficial for these patients.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116129189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACM Notice of Article Removal: Deep Learning Based Medical Diagnosis System Using Multiple Data Sources - originally published in the ACM Digital Library on 29-Aug-2018","authors":"Qinghan Xue, M. Chuah","doi":"10.1145/3233547.3233730","DOIUrl":"https://doi.org/10.1145/3233547.3233730","url":null,"abstract":"Recently, many researchers have conducted data mining over medical data to uncover hidden patterns and use them to learn prediction models for clinical decision making and personalized medicine. While such healthcare learning models can achieve encouraging results, they seldom incorporate existing expert knowledge into their frameworks and hence prediction accuracy for individual patients can still be improved. However, expert knowledge spans across various websites and multiple databases with heterogeneous representations and hence is difficult to harness for improving learning models. In addition, patients' queries at medical consult websites are often ambiguous in their specified terms and hence the returned responses may not contain the information they seek. To tackle these problems, we first design a knowledge extraction framework that can generate an aggregated dataset to characterize diseases by integrating heterogeneous medical data sources. Then, based on the integrated dataset, we propose an end-to-end deep learning based medical diagnosis system (DL-MDS) to provide disease diagnosis for authorized users. Evaluations on real-world data demonstrate that our proposed system achieves good performance on diseases diagnosis with a diverse set of patients' queries.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge Extraction of Long-Term Complications from Clinical Narratives of Blood Cancer Patients with HCT Treatments","authors":"Weizhong Zhu, J. B. Teh, Haiqing Li, S. Armenian","doi":"10.1145/3233547.3233635","DOIUrl":"https://doi.org/10.1145/3233547.3233635","url":null,"abstract":"Interactive information extraction (IE) systems supported by biomedical ontologies are intelligent natural language processing (NLP) tools to understand literature and clinical narratives and discover meaningful domain knowledge from unstructured text. This study developed integrated IE systems to detect treatment complications of blood cancer patients from Electrical Medical Records (EMR) in the Long-Term Follow-Up (LTFU) protocol following Hematopoietic Cell Transplantation (HCT). The performance of the proposed approach was very encouraging compared to the gold-standard datasets manually reviewed by domain experts. In addition, the NLP system identified significant amount of cases not caught by experts.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128510142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAPS","authors":"Jinbu Wang, B. Chen","doi":"10.1145/3233547.3233710","DOIUrl":"https://doi.org/10.1145/3233547.3233710","url":null,"abstract":"The adaptive immune system is a defense system against repeated infection. In order to trigger the immune response, antigen peptides from the infecting agent must first be recognized by the Major Histocompatibility Complex (MHC) proteins. Identifying peptides that bind to MHC class II is thus a critical step in vaccine development. We hypothesize that comparing individual subsites of the peptide binding groove could predict the individual amino acids of possible antigens. This modularized approach to individual subsites could reduce the amount of training data needed for accurate classification while also reducing computing times associated with molecular simulation and docking. To test this hypothesis, we evaluated the capability of two classification techniques and multiple modular representations of the MHC subsites to correctly classify the binding preference categories of P1 subsites of MHC class II structures. Our results shows that the average accuracies are 0.87 for K-mean and 0.95 for SVM with all feature vector configurations. Our results demonstrate that accurate predictions on individual binding subsites is possible, pointing to larger scale applications predicting whole-peptide preferences.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121616889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}