Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics最新文献_第7页

Efficient Distance Calculations Between Genomes Using Mathematical Approximation 利用数学近似计算基因组之间的有效距离

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233654

Y. Hrytsenko, Noah M. Daniels, R. Schwartz

{"title":"Efficient Distance Calculations Between Genomes Using Mathematical Approximation","authors":"Y. Hrytsenko, Noah M. Daniels, R. Schwartz","doi":"10.1145/3233547.3233654","DOIUrl":"https://doi.org/10.1145/3233547.3233654","url":null,"abstract":"Clustering biological samples allows us to define populations within groups (for example of species or cells), which permits us to answer questions about the processes occuring in those groups. Distance calculations between DNA sequences have been used to build clusters of samples. However, distance calculations for genome-scale data are limited to a small number of samples due to the size of genomic data. For example, a human genome sequenced at 10X coverage is approximately 30Gb in size. Thus, to understand biological samples it is necessary to develop efficient, accurate methods to calculate distances among many genomes. This will allow us to see similarities and differences between DNA sequences, examine their mutational patterns, and better understand evolution. In this project, we calculated cosine distances among human genome samples based on k-mer frequencies. We used publicly available Illumina reads from human genome samples from five populations. We calculated k-mer frequencies for multiple values of k in each genome sample using Jellyfish, a tool for fast, memory-efficient counting of k-mers in DNA. We calculated cosine distances between human genome k-mer profiles based on the frequency of each k-mer. We used these distances to build dendrograms of samples and infer clustering. For k-mers where k<=12, distance calculations were fast, but these distances did not capture expected population structure (i.e. known ancestry of samples). In contrast, population structure should be captured accurately for large k, but distance calculations are computationally intractable. Thus, we need an efficient way to compute genomic distance using large k. We hypothesized that the majority of k-mers are infrequent and would not contribute to the dot product in the cosine distance calculation. Thus, removing these k-mers from the calculation will possibly reduce computation time without impacting the cosine distance. In order to better understand the distribution of k-mer frequencies we built histograms. Frequencies were normalized by the level of genome coverage. We filtered out k-mers with frequencies below 10^0, 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, and 10^7. We recalculated cosine distance for each set of filtered k-mers. We examined the closest neighboring sample to each sample. Each sample's nearest neighbor remained the same after filtering out frequencies below 105. We rebuilt the dendrograms and confirmed that clustering of samples was not affected by filtering out frequencies below 105, which we determined to be the optimal filter value. Calculating cosine distances on filtered frequencies was 25 times faster than calculating on unfiltered frequencies. Calculating cosine distance between a pair of 12-mer human genome profiles takes 48 seconds using TensorFlow with a GPU-based framework; thus, distance calculations for 25 samples would take four hours. However, calculating distance on a pair of 12-mer profiles that does not include frequencies be","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The Art of Connectivity Mapping 连接映射的艺术

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233610

Avi Ma’ayan

引用次数: 0

Cloud-based Semantic Integration and Knowledge Discovery Systems in Precision Medicine 精准医学中基于云的语义集成和知识发现系统

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233650

Chuming Chen, Julie E. Cowart, Jia Ren, Sachin Gavali, Yuqi Wang, Hongzhan Huang, Cathy H. Wu, L. Arminski, Jian Zhang, P. McGarvey

{"title":"Cloud-based Semantic Integration and Knowledge Discovery Systems in Precision Medicine","authors":"Chuming Chen, Julie E. Cowart, Jia Ren, Sachin Gavali, Yuqi Wang, Hongzhan Huang, Cathy H. Wu, L. Arminski, Jian Zhang, P. McGarvey","doi":"10.1145/3233547.3233650","DOIUrl":"https://doi.org/10.1145/3233547.3233650","url":null,"abstract":"We have developed a cloud-based (AWS and IBM SoftLayer) knowledge environment for scalable semantic mining of scientific literature and PTM integrative knowledge discovery in precision medicine, building upon our novel natural language processing (NLP) technologies and bioinformatics infrastructure. We provided semantic integration of full-scale PubMed mining results from disparate text mining tools, along with kinase-substrate data from iPTMnet, and PTM proteoforms and their relations from Protein Ontology (PRO). We shared the digital objects of those applications in multiple interoperable formats and have registered them in bioCADDIE using CEDAR. We experimented with multiple system setups using operating system, programming language, web server, or database server that best fits each application. We evaluated the cost effectiveness of cloud computing by only paying for what we use and readily experimenting with additional services. A web portal is available for accessing our cloud-based knowledge environment at https://proteininformationresource.org/cloud/.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125107783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem 一种实用高效的k-不匹配最短唯一子串查找算法

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233564

Daniel R. Allen, Sharma V. Thankachan, Bojian Xu

{"title":"A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem","authors":"Daniel R. Allen, Sharma V. Thankachan, Bojian Xu","doi":"10.1145/3233547.3233564","DOIUrl":"https://doi.org/10.1145/3233547.3233564","url":null,"abstract":"This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130353220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Single-cell Clustering Based on Word Embedding and Nonparametric Methods 基于词嵌入和非参数方法的单细胞聚类

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233590

Tianyu Wang, S. Nabavi

{"title":"Single-cell Clustering Based on Word Embedding and Nonparametric Methods","authors":"Tianyu Wang, S. Nabavi","doi":"10.1145/3233547.3233590","DOIUrl":"https://doi.org/10.1145/3233547.3233590","url":null,"abstract":"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127163922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Recursive Model for Dose-time Responses in Pharmacological Studies 药理学研究中剂量-时间反应的递归模型

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233681

Aminur Rahman, S. R. Dhruba, Souparno Ghosh, R. Pal

引用次数: 0

Phenotyping Immune Cells in Tumor and Healthy Tissue Using Flow Cytometry Data 利用流式细胞术数据对肿瘤和健康组织中的免疫细胞进行表型分析

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233583

Ye Chen, R. D. Calvert, A. Azad, Bartek Rajwa, J. Fleet, Timothy Ratliff, A. Pothen

引用次数: 1

A Cooperative Vehicle Routing Platform for Logistic Management in Healthcare 面向医疗物流管理的协同车辆路径平台

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233728

Valentina Falvo, M. Scalise, Francesco Lupia, Pierfrancesco Casella, M. Cannataro

引用次数: 0

On the Minimum Copy Number Generation Problem in Cancer Genomics 肿瘤基因组学中最小拷贝数生成问题

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233586

L. Qingge, Xiaozhou He, Zhihui Liu, B. Zhu

{"title":"On the Minimum Copy Number Generation Problem in Cancer Genomics","authors":"L. Qingge, Xiaozhou He, Zhihui Liu, B. Zhu","doi":"10.1145/3233547.3233586","DOIUrl":"https://doi.org/10.1145/3233547.3233586","url":null,"abstract":"In cancer genomics, due to the fast somatic mutations (mainly random segment duplications and deletions), copy number profiles (CNPs), (i.e., a file containing each of the gene numbers) are used more often than genome themselves. On the other hand, algorithms with performance analysis for processing CNPs are lacking. In a recent CPM'16 paper, Shamir et al. studied the copy number transformation problem, which is to use the minimum number of duplications and deletions (on the CNPs) to convert one CNP to another, and gave a linear time algorithm. In this paper, we consider a slightly different problem which is called Minimum Copy Number Generation (MCNG), namely, given a genome G and a specific CNP C, use the minimum number of duplications and deletions on G to obtain some genome H which has a CNP C. We show that the problem is NP-hard if G is generic (i.e., contains duplicated genes) and when the duplications are tandem. On the other hand, when only tandem duplications are allowed, if G is exemplar (or is a permutation) and all components in C are power of two's, then the problem can be solved in time linear in the length of the input (or $|C|$) plus $O(|G|log |G]|)$ (the cost for sorting $|G|$ elements). That naturally extends to a practical heuristic algorithm for the problem (when G is exemplar and the components in C are arbitrary). We also show that two variations of the MCNG problem are at least as hard as Set Cover in terms of approximability and FPT tractability. For the general Minimum Copy Number Generation problem, i.e., when both (arbitrary) segment duplications and deletions are allowed, we also design a practical greedy algorithm, present some non-trivial cases and discuss the directions for future research.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124117102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

EMNets

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI: 10.1145/3233547.3233707

Jingjing Yang, Renzhi Cao, Dong Si

引用次数: 4