{"title":"基于词嵌入和非参数方法的单细胞聚类","authors":"Tianyu Wang, S. Nabavi","doi":"10.1145/3233547.3233590","DOIUrl":null,"url":null,"abstract":"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Single-cell Clustering Based on Word Embedding and Nonparametric Methods\",\"authors\":\"Tianyu Wang, S. Nabavi\",\"doi\":\"10.1145/3233547.3233590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.\",\"PeriodicalId\":131906,\"journal\":{\"name\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3233547.3233590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Single-cell Clustering Based on Word Embedding and Nonparametric Methods
Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.