Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics最新文献_第9页

Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets 使用简化字母的抗菌肽机器学习分类

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233657

M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman

{"title":"Machine Learning Classification of Antimicrobial Peptides Using Reduced Alphabets","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, I. Vaisman","doi":"10.1145/3233547.3233657","DOIUrl":"https://doi.org/10.1145/3233547.3233657","url":null,"abstract":"Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122509419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage KmerEstimate:一种估算具有最佳空间使用的k-mer计数的流算法

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233587

S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran

{"title":"KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage","authors":"S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran","doi":"10.1145/3233547.3233587","DOIUrl":"https://doi.org/10.1145/3233547.3233587","url":null,"abstract":"The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at urlhttps://github.com/srbehera11/KmerEstimate.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122982487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Identification of Non-invasive Cytokine Biomarkers for Polycystic Ovary Syndrome Using Supervised Machine Learning 使用监督机器学习鉴定多囊卵巢综合征的非侵入性细胞因子生物标志物

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233611

D. S. Perry, J. Gunawardena, N. Orsi

引用次数: 1

Prediction of Clinical Outcomes of Spinal Muscular Atrophy Using Motion Tracking Data and Elastic Net Regression 用运动追踪数据和弹性网回归预测脊髓性肌萎缩症的临床结果

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233572

David Chen, S. Rust, Enju Lin, Simon M. Lin, Leslie Nelson, L. Alfano, L. Lowes

引用次数: 1

Detecting Chromosomal Inversions from Dense SNPs by Combining PCA and Association Tests 结合PCA和关联试验检测密集snp的染色体倒位

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233571

R. J. Nowling, S. Emrich

{"title":"Detecting Chromosomal Inversions from Dense SNPs by Combining PCA and Association Tests","authors":"R. J. Nowling, S. Emrich","doi":"10.1145/3233547.3233571","DOIUrl":"https://doi.org/10.1145/3233547.3233571","url":null,"abstract":"Principal Component Analysis (PCA) of dense single nucleotide polymorphism (SNP) data has wide-ranging applications in populations genetics, including detection of chromosomal inversions. SNPs associated with each PC can be identified through single-SNP association tests performed between SNP genotypes and PC coordinates; this approach has several advantages over thresholding loading factors or sparse PCA methods. Insect vector SNP data often have a high proportion of unknown (uncalled) genotypes, however, that cannot be reliably imputed and prevent the direct usage of association tests. Building on our previous work, we propose a novel method for adjusting the association tests to handle these unknown genotypes. We demonstrate the utility of the method through two applications: detecting chromosomal inversions and characterizing differentiation processed captured by PCA. When applied to SNP data from the 2L and 2R chromosome arms of 34 karyotyped Anopheles gambiae and Anopheles coluzzii mosquitoes, our method clearly identifies the 2La, 2Rb, 2Rc, 2Rj, and 2Ru inversions. Using our method to identify SNP associated with 2L-PC3, we observed one of the two insecticide-resistance variants in the Rdl gene; our results suggests that the PC is capturing differentiation driven by insecticide usage.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121410571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

clustQ clustQ

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233570

R. Alapati, Debswapna Bhattacharya

{"title":"clustQ","authors":"R. Alapati, Debswapna Bhattacharya","doi":"10.1145/3233547.3233570","DOIUrl":"https://doi.org/10.1145/3233547.3233570","url":null,"abstract":"Structure of a protein largely determines its functional properties. Hence, the knowledge of the protein's 3D structure is an important aspect in determining solutions to fundamental biological problems. Structure prediction algorithms generally employ clustering algorithm to select the optimal model for a target from a large number of predicted confirmations (a.k.a. decoy). Despite significant advancement in clustering-based optimal decoy selection methods, these approaches often cannot deliver high performance in terms of the time taken to cluster large number of protein structures owing to the computational cost associated with pairwise structural superpositions. Here, we propose a superposition-free approach to protein decoy clustering, called clustQ, based on weighted internal distance comparisons. Experimental results suggest that the novel weighing scheme is helpful in both reproducing the decoy-native similarity score and estimating pairwise clustering based predicted quality score in a computationally efficient manner. clustQ attains performance comparable to the state-of-the-art multi-model decoy quality estimation methods participating in the latest Critical Assessment of protein Structure Prediction (CASP) experiments irrespective of target difficulty. Moreover, clustQ predicted score offers a unique way to reliably estimate target difficulty without the knowledge of the experimental structure. clustQ is freely available at http://watson.cse.eng.auburn.edu/clustQ/.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115888453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Use of the Informatics for Integrating Biology and the Bedside (i2b2) Population to Test Serum Bilirubin Levels and Risk for Inflammatory Bowl Diseases and the Involvement of Uridine Glucuronosyltransferase Genes 利用信息学整合生物学和床边(i2b2)人群检测血清胆红素水平和炎症性肠病的风险以及尿苷糖醛酸转移酶基因的参与

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233638

C. Gallagher

{"title":"Use of the Informatics for Integrating Biology and the Bedside (i2b2) Population to Test Serum Bilirubin Levels and Risk for Inflammatory Bowl Diseases and the Involvement of Uridine Glucuronosyltransferase Genes","authors":"C. Gallagher","doi":"10.1145/3233547.3233638","DOIUrl":"https://doi.org/10.1145/3233547.3233638","url":null,"abstract":"Chronic inflammation associated with inflammatory bowel disease (IBD) results in increased oxidative stress that damages the colonic microenvironment. A low level of serum bilirubin, an endogenous antioxidant, has been associated with increased risk for Crohn's disease (CD), but no study has tested another common IBD ulcerative colitis (UC). Bilirubin is metabolized in the liver by uridine glucuronosyltransferase 1A1 (UGT1A1) exclusively. Genetic variants cause functional changes in UGT1A1 which result in hyperbilirubinemia, which can be toxic to tissues if untreated and results in a characteristic jaundiced appearance. Approximately 10% of the Caucasian population is homozygous for the microsatellite polymorphism UGT1A1*28, which results in increased total serum bilirubin levels due to reduced transcriptional efficiency of UGT1A1 and an overall 70% reduction in UGT1A1 enzymatic activity. The aim of this study was to examine whether bilirubin levels are associated with the risk for ulcerative colitis (UC). Using the Informatics for Integrating Biology and the Bedside (i2b2), a large case-control population was identified from a single tertiary care center, Penn State Hershey Medical Center (PSU). Similarly, a validation cohort was identified at Virginia Commonwealth University Medical Center. Logistic regression analysis was performed to determine the risk of developing UC with lower concentrations of serum bilirubin. From the PSU cohort, a subset of terminal ileum tissue was obtained at the time of surgical resection to analyze UGT1A1 gene expression (which encodes the enzyme responsible for bilirubin metabolism). Similar to CD patients, UC patients also demonstrated reduced levels of total serum bilirubin. Upon segregating serum bilirubin levels into quartiles, risk of UC increased with reduced concentrations of serum bilirubin. These results were confirmed in our validation cohort. UGT1A1 gene expression was up-regulated in the terminal ileum of a subset of UC patients. Lower levels of the antioxidant bilirubin may reduce the capability of UC patients to remove reactive oxygen species leading to an increase in intestinal injury. One potential explanation for these lower bilirubin levels may be up-regulation of UGT1A1 gene expression, which encodes the only enzyme involved in conjugating bilirubin. Therapeutics that reduce oxidative stress may be beneficial for these patients.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116129189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ACM Notice of Article Removal: Deep Learning Based Medical Diagnosis System Using Multiple Data Sources - originally published in the ACM Digital Library on 29-Aug-2018 ACM文章删除通知:使用多个数据源的基于深度学习的医疗诊断系统-最初发表于ACM数字图书馆2018年8月29日

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233730

Qinghan Xue, M. Chuah

{"title":"ACM Notice of Article Removal: Deep Learning Based Medical Diagnosis System Using Multiple Data Sources - originally published in the ACM Digital Library on 29-Aug-2018","authors":"Qinghan Xue, M. Chuah","doi":"10.1145/3233547.3233730","DOIUrl":"https://doi.org/10.1145/3233547.3233730","url":null,"abstract":"Recently, many researchers have conducted data mining over medical data to uncover hidden patterns and use them to learn prediction models for clinical decision making and personalized medicine. While such healthcare learning models can achieve encouraging results, they seldom incorporate existing expert knowledge into their frameworks and hence prediction accuracy for individual patients can still be improved. However, expert knowledge spans across various websites and multiple databases with heterogeneous representations and hence is difficult to harness for improving learning models. In addition, patients' queries at medical consult websites are often ambiguous in their specified terms and hence the returned responses may not contain the information they seek. To tackle these problems, we first design a knowledge extraction framework that can generate an aggregated dataset to characterize diseases by integrating heterogeneous medical data sources. Then, based on the integrated dataset, we propose an end-to-end deep learning based medical diagnosis system (DL-MDS) to provide disease diagnosis for authorized users. Evaluations on real-world data demonstrate that our proposed system achieves good performance on diseases diagnosis with a diverse set of patients' queries.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge Extraction of Long-Term Complications from Clinical Narratives of Blood Cancer Patients with HCT Treatments 从HCT治疗的血癌患者临床叙述中提取长期并发症的知识

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233635

Weizhong Zhu, J. B. Teh, Haiqing Li, S. Armenian

引用次数: 1

MAPS 地图

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233710

Jinbu Wang, B. Chen

{"title":"MAPS","authors":"Jinbu Wang, B. Chen","doi":"10.1145/3233547.3233710","DOIUrl":"https://doi.org/10.1145/3233547.3233710","url":null,"abstract":"The adaptive immune system is a defense system against repeated infection. In order to trigger the immune response, antigen peptides from the infecting agent must first be recognized by the Major Histocompatibility Complex (MHC) proteins. Identifying peptides that bind to MHC class II is thus a critical step in vaccine development. We hypothesize that comparing individual subsites of the peptide binding groove could predict the individual amino acids of possible antigens. This modularized approach to individual subsites could reduce the amount of training data needed for accurate classification while also reducing computing times associated with molecular simulation and docking. To test this hypothesis, we evaluated the capability of two classification techniques and multiple modular representations of the MHC subsites to correctly classify the binding preference categories of P1 subsites of MHC class II structures. Our results shows that the average accuracies are 0.87 for K-mean and 0.95 for SVM with all feature vector configurations. Our results demonstrate that accurate predictions on individual binding subsites is possible, pointing to larger scale applications predicting whole-peptide preferences.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121616889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0