基于词嵌入和非参数方法的单细胞聚类

Tianyu Wang, S. Nabavi
{"title":"基于词嵌入和非参数方法的单细胞聚类","authors":"Tianyu Wang, S. Nabavi","doi":"10.1145/3233547.3233590","DOIUrl":null,"url":null,"abstract":"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Single-cell Clustering Based on Word Embedding and Nonparametric Methods\",\"authors\":\"Tianyu Wang, S. Nabavi\",\"doi\":\"10.1145/3233547.3233590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.\",\"PeriodicalId\":131906,\"journal\":{\"name\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3233547.3233590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

识别细胞类型是单细胞RNA测序(scRNAseq)技术的重要应用之一,它提供了对细胞水平机制和变化的见解。现有的细胞类型鉴定方法大多只利用表达矩阵对细胞进行聚类;然而,一些研究表明,考虑基因之间的关系到细胞聚类过程的好处。在这项研究中,我们提出了一种新的方法,基因移动距离(GMD),该方法基于非参数地球移动距离(EMD),并利用一种新的词嵌入方法来聚类细胞。该方法利用基因间的固有距离及其表达值来计算聚类的距离度量。我们采用在生物语料库上训练的词嵌入word2vec模型捕获基因之间的关系,并采用EMD将一个细胞视为一组加权点(基因)来计算细胞之间的距离。我们使用三个单细胞数据集来验证所提出的方法,并与三种最先进的聚类方法进行比较,评估其性能。结果表明,GMD在调整随机指数和Fowlkes Mallows指数方面优于单细胞聚类方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Single-cell Clustering Based on Word Embedding and Nonparametric Methods
Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信