{"title":"Efficient Distance Calculations Between Genomes Using Mathematical Approximation","authors":"Y. Hrytsenko, Noah M. Daniels, R. Schwartz","doi":"10.1145/3233547.3233654","DOIUrl":"https://doi.org/10.1145/3233547.3233654","url":null,"abstract":"Clustering biological samples allows us to define populations within groups (for example of species or cells), which permits us to answer questions about the processes occuring in those groups. Distance calculations between DNA sequences have been used to build clusters of samples. However, distance calculations for genome-scale data are limited to a small number of samples due to the size of genomic data. For example, a human genome sequenced at 10X coverage is approximately 30Gb in size. Thus, to understand biological samples it is necessary to develop efficient, accurate methods to calculate distances among many genomes. This will allow us to see similarities and differences between DNA sequences, examine their mutational patterns, and better understand evolution. In this project, we calculated cosine distances among human genome samples based on k-mer frequencies. We used publicly available Illumina reads from human genome samples from five populations. We calculated k-mer frequencies for multiple values of k in each genome sample using Jellyfish, a tool for fast, memory-efficient counting of k-mers in DNA. We calculated cosine distances between human genome k-mer profiles based on the frequency of each k-mer. We used these distances to build dendrograms of samples and infer clustering. For k-mers where k<=12, distance calculations were fast, but these distances did not capture expected population structure (i.e. known ancestry of samples). In contrast, population structure should be captured accurately for large k, but distance calculations are computationally intractable. Thus, we need an efficient way to compute genomic distance using large k. We hypothesized that the majority of k-mers are infrequent and would not contribute to the dot product in the cosine distance calculation. Thus, removing these k-mers from the calculation will possibly reduce computation time without impacting the cosine distance. In order to better understand the distribution of k-mer frequencies we built histograms. Frequencies were normalized by the level of genome coverage. We filtered out k-mers with frequencies below 10^0, 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, and 10^7. We recalculated cosine distance for each set of filtered k-mers. We examined the closest neighboring sample to each sample. Each sample's nearest neighbor remained the same after filtering out frequencies below 105. We rebuilt the dendrograms and confirmed that clustering of samples was not affected by filtering out frequencies below 105, which we determined to be the optimal filter value. Calculating cosine distances on filtered frequencies was 25 times faster than calculating on unfiltered frequencies. Calculating cosine distance between a pair of 12-mer human genome profiles takes 48 seconds using TensorFlow with a GPU-based framework; thus, distance calculations for 25 samples would take four hours. However, calculating distance on a pair of 12-mer profiles that does not include frequencies be","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Art of Connectivity Mapping","authors":"Avi Ma’ayan","doi":"10.1145/3233547.3233610","DOIUrl":"https://doi.org/10.1145/3233547.3233610","url":null,"abstract":"Motivation: The powerful idea of the Connectivity Mapping proposes the creation of a library of drug induced gene expression signatures. Such a resource can facilitate finding small molecules to mimic or reverse disease signatures, identifying drug targets, discovering the mechanisms of action for novel small molecules, elucidating off-target effect mechanisms, and directing cellular differentiation and reprogramming. A related concept is Gene Set Enrichment Analysis. Problem statement: In my presentation I will discuss how these two transformative ideas can be expanded in various creative ways to unify knowledge representation in system biology. Approach: I will demonstrate how expanded Connectivity Mapping and Gene Set Enrichment Analyses combined with Machine Learning can enable imputing and illuminating new biological and pharmacological knowledge.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116603876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuming Chen, Julie E. Cowart, Jia Ren, Sachin Gavali, Yuqi Wang, Hongzhan Huang, Cathy H. Wu, L. Arminski, Jian Zhang, P. McGarvey
{"title":"Cloud-based Semantic Integration and Knowledge Discovery Systems in Precision Medicine","authors":"Chuming Chen, Julie E. Cowart, Jia Ren, Sachin Gavali, Yuqi Wang, Hongzhan Huang, Cathy H. Wu, L. Arminski, Jian Zhang, P. McGarvey","doi":"10.1145/3233547.3233650","DOIUrl":"https://doi.org/10.1145/3233547.3233650","url":null,"abstract":"We have developed a cloud-based (AWS and IBM SoftLayer) knowledge environment for scalable semantic mining of scientific literature and PTM integrative knowledge discovery in precision medicine, building upon our novel natural language processing (NLP) technologies and bioinformatics infrastructure. We provided semantic integration of full-scale PubMed mining results from disparate text mining tools, along with kinase-substrate data from iPTMnet, and PTM proteoforms and their relations from Protein Ontology (PRO). We shared the digital objects of those applications in multiple interoperable formats and have registered them in bioCADDIE using CEDAR. We experimented with multiple system setups using operating system, programming language, web server, or database server that best fits each application. We evaluated the cost effectiveness of cloud computing by only paying for what we use and readily experimenting with additional services. A web portal is available for accessing our cloud-based knowledge environment at https://proteininformationresource.org/cloud/.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125107783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem","authors":"Daniel R. Allen, Sharma V. Thankachan, Bojian Xu","doi":"10.1145/3233547.3233564","DOIUrl":"https://doi.org/10.1145/3233547.3233564","url":null,"abstract":"This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130353220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-cell Clustering Based on Word Embedding and Nonparametric Methods","authors":"Tianyu Wang, S. Nabavi","doi":"10.1145/3233547.3233590","DOIUrl":"https://doi.org/10.1145/3233547.3233590","url":null,"abstract":"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127163922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aminur Rahman, S. R. Dhruba, Souparno Ghosh, R. Pal
{"title":"Recursive Model for Dose-time Responses in Pharmacological Studies","authors":"Aminur Rahman, S. R. Dhruba, Souparno Ghosh, R. Pal","doi":"10.1145/3233547.3233681","DOIUrl":"https://doi.org/10.1145/3233547.3233681","url":null,"abstract":"Clinical studies often track dose-response curves of subjects over time. One can easily model dose-response curve at each time point with Hill equation, but such a model fails to capture the temporal evolution of curves. On the other hand, one can use Gompertz equation to model the dose-time curves at each time point without capturing the evolution of time curves across dosage. In this article, we propose a parametric model for dose-time responses that follows Gompertz law in time and approximately follows Hill equation across dose. We derive a recursion relation for dose-response curves over time capturing the temporal evolution. We then specify a regression model connecting the parameters controlling the dose-time responses with individual level proteomic data. The resultant joint model allows us to predict the dose-response curves over time for new individuals. We illustrate the superior performance of our proposed model as compared to the individual models using data from the HMS-LINCS database.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127684900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Chen, R. D. Calvert, A. Azad, Bartek Rajwa, J. Fleet, Timothy Ratliff, A. Pothen
{"title":"Phenotyping Immune Cells in Tumor and Healthy Tissue Using Flow Cytometry Data","authors":"Ye Chen, R. D. Calvert, A. Azad, Bartek Rajwa, J. Fleet, Timothy Ratliff, A. Pothen","doi":"10.1145/3233547.3233583","DOIUrl":"https://doi.org/10.1145/3233547.3233583","url":null,"abstract":"We present an automated pipeline capable of distinguishing the phenotypes of myeloid-derived suppressor cells (MDSC) in healthy and tumor-bearing tissues in mice using flow cytometry data. In contrast to earlier work where samples are analyzed individually, we analyze all samples from each tissue collectively using a representative template for it. We demonstrate with 43 flow cytometry samples collected from three tissues, naive bone-marrow, spleens of tumor-bearing mice, and intra-peritoneal tumor, that a set of templates serves as a better classifier than popular machine learning approaches including support vector machines and neural networks. Our \"interpretable machine learning\" approach goes beyond classification and identifies distinctive phenotypes associated with each tissue, information that is clinically useful. Hence the pipeline presented here leads to better understanding of the maturation and differentiation of MDSCs using high-throughput data.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"10074 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121038096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentina Falvo, M. Scalise, Francesco Lupia, Pierfrancesco Casella, M. Cannataro
{"title":"A Cooperative Vehicle Routing Platform for Logistic Management in Healthcare","authors":"Valentina Falvo, M. Scalise, Francesco Lupia, Pierfrancesco Casella, M. Cannataro","doi":"10.1145/3233547.3233728","DOIUrl":"https://doi.org/10.1145/3233547.3233728","url":null,"abstract":"In order to reduce the cost of healthcare processes, optimization systems are used to optimize logistics in healthcare. Algorithms for solving the so-called Vehicle Routing Problem (VRP) are more and more applied in healthcare systems requiring the movement of nursery/medical staff or patients. In this paper, we introduce a novel software platform that uses a cooperative vehicle routing algorithm and is able to reduce transportation costs in healthcare applications requiring the movement of nursery/medical staff. The COOP_VR platform adapts an already existing VRP algorithm to the healthcare context and allows the cooperation between two independent healthcare organizations (shippers) that manage their own vehicle fleets in a given geographic area. Preliminary simulation results show the possibility to reduce costs for both healthcare organizations in a range between 1% and 21% of the initial transportation costs.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Minimum Copy Number Generation Problem in Cancer Genomics","authors":"L. Qingge, Xiaozhou He, Zhihui Liu, B. Zhu","doi":"10.1145/3233547.3233586","DOIUrl":"https://doi.org/10.1145/3233547.3233586","url":null,"abstract":"In cancer genomics, due to the fast somatic mutations (mainly random segment duplications and deletions), copy number profiles (CNPs), (i.e., a file containing each of the gene numbers) are used more often than genome themselves. On the other hand, algorithms with performance analysis for processing CNPs are lacking. In a recent CPM'16 paper, Shamir et al. studied the copy number transformation problem, which is to use the minimum number of duplications and deletions (on the CNPs) to convert one CNP to another, and gave a linear time algorithm. In this paper, we consider a slightly different problem which is called Minimum Copy Number Generation (MCNG), namely, given a genome G and a specific CNP C, use the minimum number of duplications and deletions on G to obtain some genome H which has a CNP C. We show that the problem is NP-hard if G is generic (i.e., contains duplicated genes) and when the duplications are tandem. On the other hand, when only tandem duplications are allowed, if G is exemplar (or is a permutation) and all components in C are power of two's, then the problem can be solved in time linear in the length of the input (or $|C|$) plus $O(|G|log |G]|)$ (the cost for sorting $|G|$ elements). That naturally extends to a practical heuristic algorithm for the problem (when G is exemplar and the components in C are arbitrary). We also show that two variations of the MCNG problem are at least as hard as Set Cover in terms of approximability and FPT tractability. For the general Minimum Copy Number Generation problem, i.e., when both (arbitrary) segment duplications and deletions are allowed, we also design a practical greedy algorithm, present some non-trivial cases and discuss the directions for future research.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124117102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EMNets","authors":"Jingjing Yang, Renzhi Cao, Dong Si","doi":"10.1145/3233547.3233707","DOIUrl":"https://doi.org/10.1145/3233547.3233707","url":null,"abstract":"Protein surface shape plays an essential role in various function of proteins. In order to efficiently investigate protein function and evolutionary history, we introduce a global protein surface shape representation called EMNets. EMNets provides an effective and accurate way of protein surface representation and similarity search, and thus contributes to biomedical research. The method uses a Convolutional Autoencoder (CAE) neural network to learn the geometric information of three-dimensional (3D) density maps in a data-driven manner. Our method effectively represents a 3D cryo-electron microscopy density map by using a descriptor consists of only 256 numeric variables which is called EMNets descriptor. Based on EMNets descriptor, we are able to retrieve similar protein surfaces using k-nearest-neighbor algorithm in real-time. The search results of protein surface represented with the EMNets descriptor has shown high agreement with the existing Combinatorial Extension (CE) algorithm of sequence and structure similarity search. Overall, EMNets is a powerful tool in comparing 3D protein structures obtained by cryo-electron microscopy.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115369942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}