Simple and effective embedding model for single-cell biology built from ChatGPT

IF 26.8 1区 医学 Q1 ENGINEERING, BIOMEDICAL
Yiqun Chen, James Zou
{"title":"Simple and effective embedding model for single-cell biology built from ChatGPT","authors":"Yiqun Chen, James Zou","doi":"10.1038/s41551-024-01284-6","DOIUrl":null,"url":null,"abstract":"<p>Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene’s expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models—particularly, tasks of gene-property and cell-type classifications—our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.</p>","PeriodicalId":19063,"journal":{"name":"Nature Biomedical Engineering","volume":"4 1","pages":""},"PeriodicalIF":26.8000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41551-024-01284-6","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene’s expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models—particularly, tasks of gene-property and cell-type classifications—our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

Abstract Image

基于ChatGPT构建的单细胞生物学简单有效的嵌入模型
大规模的基因表达数据被用来预训练隐式学习基因和细胞功能的模型。然而,这样的模型需要大量的数据管理和培训。在这里,我们探索一个更简单的替代方案:利用基于文献的ChatGPT基因嵌入。我们使用GPT-3.5从单个基因的文本描述中生成基因嵌入,然后通过对每个基因表达水平加权的基因嵌入进行平均来生成单细胞嵌入。我们还通过仅使用按表达水平排序的基因名称为每个细胞创建了一个句子嵌入。在许多用于评估预训练的单细胞嵌入模型的下游任务中,特别是基因特性和细胞类型分类的任务,我们的模型,我们将其命名为GenePT,与从数百万细胞的基因表达谱中预训练的模型相比,取得了相当或更好的性能。GenePT表明,文献的大语言模型嵌入为编码单细胞生物学知识提供了一种简单有效的途径。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nature Biomedical Engineering
Nature Biomedical Engineering Medicine-Medicine (miscellaneous)
CiteScore
45.30
自引率
1.10%
发文量
138
期刊介绍: Nature Biomedical Engineering is an online-only monthly journal that was launched in January 2017. It aims to publish original research, reviews, and commentary focusing on applied biomedicine and health technology. The journal targets a diverse audience, including life scientists who are involved in developing experimental or computational systems and methods to enhance our understanding of human physiology. It also covers biomedical researchers and engineers who are engaged in designing or optimizing therapies, assays, devices, or procedures for diagnosing or treating diseases. Additionally, clinicians, who make use of research outputs to evaluate patient health or administer therapy in various clinical settings and healthcare contexts, are also part of the target audience.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信