Dondra Bailey, Kawther Abdilleh, Boris Aguilar, Alexis McClary
{"title":"Multi-omics characterization of Microtubule-actin cross linking factor 1 (MACF1) using the ISB-Cancer Genomics Cloud","authors":"Dondra Bailey, Kawther Abdilleh, Boris Aguilar, Alexis McClary","doi":"10.1145/3388440.3414918","DOIUrl":"https://doi.org/10.1145/3388440.3414918","url":null,"abstract":"Establishment of cell polarity across cell types and organisms involves distinct mechanisms that follow a common pattern: first a polarity cue arises, followed by asymmetric organization executed by polarity proteins. Loss of cell polarity has a key role in cancer development. The MACF1 gene, Microtubule actin cross-linking factor 1, or MACF1, a cytoskeletal protein is involved in oocyte development, cell proliferation, and cell migration. In addition to these roles, MACF1 is linked to metastatic invasion leading to tumor progression in numerous human cancers including gynecological cancers of endometrial and ovarian cancer. Given the functional importance of cell polarity, here we provide computational evidence of MACF1 in gynecological cancers. The comparison of multi-omic data for patient tumor and normal cells facilitates the understanding of the molecular mechanisms that contribute to tumor cell proliferation, abnormal cell adhesion and cell migration. Leveraging the rich datasets hosted by the NCI-funded ISB-Cancer Genomics Cloud, we performed a cloud-based patient cohort analysis across diverse multi-omics datasets. We quantified differential gene expression profiles from patients in the cohort as well as identified somatic mutation differences. The most common genomic alteration for MACF1 was the in-frame mutation. Genomic alterations and mutations were aligned to functional domains of the MACF1 protein to determine both frequency and spatial distribution. Gene-gene expression correlation analyses identified statistically significant correlations between MACF1 and other well-known cancer driver genes. Together, using a data driven cloud-computing approach we gain novel insights into the role of MACF1 regulation of cell polarity in the progression of cancer.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115202106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HMSC","authors":"Subrata Saha, Zigeng Wang, S. Rajasekaran","doi":"10.1145/3388440.3412468","DOIUrl":"https://doi.org/10.1145/3388440.3412468","url":null,"abstract":"Widespread availability of next-generation sequencing (NGS) technologies has prompted a recent surge in interest in the microbiome. As a consequence, metagenomics is a fast growing field in bioinformatics and computational biology. An important problem in analyzing metagenomic sequenced data is to identify the microbes present in the sample and figure out their relative abundances. Genome databases such as RefSeq and GenBank provide a growing resource to characterize metagenomic sequenced datasets. However, both the size of these databases and the high degree of sequence homology that can exist between related genomes mean that accurate analysis of metagenomic reads is computationally challenging. In this article we propose a highly efficient algorithm dubbed as \"Hybrid Metagenomic Sequence Classifier\" (HMSC) to accurately detect microbes and their relative abundances in a metagenomic sample. The algorithmic approach is fundamentally different from other state-of-the-art algorithms currently existing in this domain. HMSC judiciously exploits both alignment-free and alignment-based approaches to accurately characterize metagenomic sequenced data. Rigorous experimental evaluations on both real and synthetic datasets show that HMSC is indeed an effective, scalable, and efficient algorithm compared to the other state-of-the-art methods in terms of accuracy, memory, and runtime.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115443069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of sample size and tissue type on the reproducibility of gene co-expression networks","authors":"K. Ovens, B. Eames, Ian McQuillan","doi":"10.1145/3388440.3412481","DOIUrl":"https://doi.org/10.1145/3388440.3412481","url":null,"abstract":"Identifying relationships between genes facilitates the comparison of different cell types at the transcriptomic level. Gene expression data such as RNA-seq can be used to construct co-expression networks, which is one means in systems biology to describe the coordinated expression patterns among genes across samples. Currently, there is no consensus as to the number of samples required to construct a reproducible gene co-expression network. Indeed, irreproducibility of gene expression experiments is a major challenge, and small sample sizes tend to be one of the major causes. However, recommending a single sample size that applies to all scenarios may not be practical. As such, we utilize a systematic, quantitative approach to study the effect of sample size on the reproducibility of constructing large, fully-connected gene co-expression networks using several correlation-based measures or mutual information. This approach does not require synthetic datasets that are constructed based on oversimplified assumptions nor is it dependent on known functional annotations. Further, we describe two similarity measures to measure consistency and use them to determine if the biological variance present within samples impacts the rate at which the networks will stabilize and compare to networks with randomly reassigned nodes. Our results show that the required number of samples to construct consistent co-expression networks could be influenced by the tissue type used to construct the networks as well as the similarity measure used to measure consistency.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131458023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method","authors":"Mengmeng Kuang, H. Ting","doi":"10.1145/3388440.3414909","DOIUrl":"https://doi.org/10.1145/3388440.3414909","url":null,"abstract":"Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129446504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MeSH Indexing Using the Biomedical Citation Network","authors":"William Gasper, P. Chundi, D. Ghersi","doi":"10.1145/3388440.3412466","DOIUrl":"https://doi.org/10.1145/3388440.3412466","url":null,"abstract":"PubMed contains over 30 million biomedical literature citations and is an invaluable resource for researchers, medical professionals, students, and curious individuals. The search and retrieval process is significantly enhanced by PubMed's Medical Subject Heading (MeSH) indexing process, which requires a significant manual component. It is difficult to effectively apply traditional machine learning methods to large scale semantic indexing problems, and this difficulty has impeded complete automation of the MeSH indexing process. PubMed citations are particularly challenging to index: documents are often indexed with a dozen or more terms, and most terms occur extremely infrequently in the document set. This work examines the biomedical literature citation network and MeSH vocabulary for viable signal that might benefit the indexing process. Simple predictive models utilizing features generated from the biomedical literature citation network proved useful and effective in recommending MeSH terms for document indexing. A neural network proved similarly effective to the simple model in terms of raw performance but produced qualitatively different term recommendations.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121484335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Learning Approach for Breast Cancer InClust 5 Prediction based on Multiomics Data Integration","authors":"A. Alkhateeb, Li Zhou, A. Tabl, L. Rueda","doi":"10.1145/3388440.3415992","DOIUrl":"https://doi.org/10.1145/3388440.3415992","url":null,"abstract":"Breast cancer is the most common cancer among North American women and worldwide. In this paper, we present a deep learning model based on multiomics data integration to predict the five-year interval survival of breast cancer InClust 5. The data was selected from METABRIC dataset that contains three omic datasets: gene expression, copy number alteration (CNA), and clinical feature datasets. The model utilizes self-organizing map (SOM), which is an unsupervised method, to create an RGB to extract feature map for each omic to be the based for the convolution layer in the convolutional neural network CNN. In total, the model creates three CNN, one for each model. This method is the expansion of the iSOM-GSN model, where we create a feature map for each omic dataset instead of only one. The model incorporates the prediction of the three CNNs using an integration layer. The integration layer votes based on the prediction of the majority as the output of the model. The main contributions are 1) integrating multiomics data module, where the models learn from all the omic datasets. 2) a model to classify 1-a Dimensional sample vector using CNN. The results show high-performance measurement where the accuracy around 94 percent.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133954035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiently mining rich subgraphs from vertex-attributed graphs","authors":"Riyad Hakim, Saeed Salem","doi":"10.1145/3388440.3412423","DOIUrl":"https://doi.org/10.1145/3388440.3412423","url":null,"abstract":"With the rapid collection of large network data such as biological networks and social networks, it has become very important to develop efficient techniques for network analysis. In many domains, additional attribute data can be associated with entities and relationships in the network, where the network data represents relationships among entities in the network and the attribute data represents various characteristics of the corresponding entities and relationships in the network. Simultaneous analysis of both network and attribute data results in detection of subnetworks that are contextually meaningful. We propose an efficient algorithm for enumerating all connected vertex sets in an undirected graph. Extending this enumeration approach, an algorithm for enumerating all maximal cohesive connected vertex sets in a vertex-attributed graph is proposed. To prune search branches that will not yield maximal patterns, we also present three pruning techniques for efficient enumeration of the maximal cohesive connected vertex sets. Our comparative runtime analyses show the efficiency and effectiveness of our proposed approaches. Gene set enrichment analysis shows that protein-protein interaction subnetworks with similar cancer dysregulation attributes are biologically significant. Availability: The implementation of the algorithm is available at http://www.cs.ndsu.nodak.edu/~ssalem/richsubgraphs.html","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130396180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diagnosing COVID-19 in X-ray Images Using HOG Image Feature and Artificial Intelligence Classifiers","authors":"Faten F. Kharbat, Tarik Elamsy, Nuha Hamada","doi":"10.1145/3388440.3415987","DOIUrl":"https://doi.org/10.1145/3388440.3415987","url":null,"abstract":"The novel coronavirus (COVID-19) pandemic is spreading across the globe at an alarming rate causing more infections and deaths in comparison to SARS or MERS. In the absence of specific vaccines for theCOVID-19, the early diagnosis of COVID-19 disease is crucial for disease treatment and control. Recent researches have shown that Medical Radiology imaging may be a more reliable, practical, and rapid method to diagnose and assess COVID-19 in comparison to the official laboratory RT-PCR tests, especially with the lack of medical professionals. In this article, we investigate the aid of Artificial Intelligence and Data Mining techniques to automate the task of diagnosing COVID-19 from Chest X-Rays medical images. The results obtained are promising and are better than previous results published earlier.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115562898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CTDPathSim","authors":"Banabithi Bose, S. Bozdag","doi":"10.1145/3388440.3412456","DOIUrl":"https://doi.org/10.1145/3388440.3412456","url":null,"abstract":"In cancer research and drug development, human tumor-derived cell lines are used as popular model for cancer patients to evaluate the biological functions of genes, drug efficacy, side-effects, and drug metabolism. Using these cell lines, the functional relationship between genes and drug response and prediction of drug response based on genomic and chemical features have been studied. Knowing the drug response on the real patients, however, is a more important and challenging task. To tackle this challenge, some studies integrate data from primary tumors and cancer cell lines to find associations between cell lines and tumors. These studies, however, do not integrate multi-omics datasets to their full extent. Also, several studies rely on a genome-wide correlation-based approach between cell lines and bulk tumor samples without considering the heterogeneous cell population in bulk tumors. To address these gaps, we developed a computational pipeline, CTDPathSim, a pathway activity-based approach to compute similarity between primary tumor samples and cell lines at genetic, genomic, and epigenetic levels integrating multi-omics datasets. We utilized a deconvolution method to get cell type-specific DNA methylation and gene expression profiles and computed deconvoluted methylation and expression profiles of tumor samples. We assessed CTDPathSim by applying on breast and ovarian cancer data in The Cancer Genome Atlas (TCGA) and cancer cell lines data in the Cancer Cell Line Encyclopedia (CCLE) databases. Our results showed that highly similar sample-cell line pairs have similar drug response compared to lowly similar pairs in several FDA-approved cancer drugs, such as Paclitaxel, Vinorelbine and Mitomycin-c. CTDPathSim outperformed state-of-the-art methods in recapitulating the known drug responses between samples and cell lines. Also, CTDPathSim selected higher number of significant cell lines belonging to the same cancer types than other methods. Furthermore, our aligned cell lines to samples were found to be clinical biomarkers for patients' survival whereas unaligned cell lines were not. Our method could guide the selection of appropriate cell lines to be more intently serve as proxy of patient tumors and could direct the pre-clinical translation of drug testing into clinical platform towards the personalized therapies. Furthermore, this study could guide the new uses for old drugs and benefits the development of new drugs in cancer treatments.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"282 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116085359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Translocator","authors":"Ye Wu, Ruibang Luo, T. Lam, H. Ting, Junwen Wang","doi":"10.1145/3388440.3412457","DOIUrl":"https://doi.org/10.1145/3388440.3412457","url":null,"abstract":"Translocation is an important class of structural variants known to be associated with cancer formation and treatment. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error-rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%. Translocator is available open-source at https://github.com/HKU-BAL/Translocator.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128783705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}