Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie
{"title":"LangCell:理解细胞特性的语言-细胞预培训","authors":"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie","doi":"arxiv-2405.06708","DOIUrl":null,"url":null,"abstract":"Cell identity encompasses various semantic aspects of a cell, including cell\ntype, pathway information, disease information, and more, which are essential\nfor biologists to gain insights into its biological characteristics.\nUnderstanding cell identity from the transcriptomic data, such as annotating\ncell types, have become an important task in bioinformatics. As these semantic\naspects are determined by human experts, it is impossible for AI models to\neffectively carry out cell identity understanding tasks without the supervision\nsignals provided by single-cell and label pairs. The single-cell pre-trained\nlanguage models (PLMs) currently used for this task are trained only on a\nsingle modality, transcriptomics data, lack an understanding of cell identity\nknowledge. As a result, they have to be fine-tuned for downstream tasks and\nstruggle when lacking labeled data with the desired semantic labels. To address\nthis issue, we propose an innovative solution by constructing a unified\nrepresentation of single-cell data and natural language during the pre-training\nphase, allowing the model to directly incorporate insights related to cell\nidentity. More specifically, we introduce \\textbf{LangCell}, the first\n\\textbf{Lang}uage-\\textbf{Cell} pre-training framework. LangCell utilizes texts\nenriched with cell identity information to gain a profound comprehension of\ncross-modal knowledge. Results from experiments conducted on different\nbenchmarks show that LangCell is the only single-cell PLM that can work\neffectively in zero-shot cell identity understanding scenarios, and also\nsignificantly outperforms existing models in few-shot and fine-tuning cell\nidentity understanding scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"189 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LangCell: Language-Cell Pre-training for Cell Identity Understanding\",\"authors\":\"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie\",\"doi\":\"arxiv-2405.06708\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cell identity encompasses various semantic aspects of a cell, including cell\\ntype, pathway information, disease information, and more, which are essential\\nfor biologists to gain insights into its biological characteristics.\\nUnderstanding cell identity from the transcriptomic data, such as annotating\\ncell types, have become an important task in bioinformatics. As these semantic\\naspects are determined by human experts, it is impossible for AI models to\\neffectively carry out cell identity understanding tasks without the supervision\\nsignals provided by single-cell and label pairs. The single-cell pre-trained\\nlanguage models (PLMs) currently used for this task are trained only on a\\nsingle modality, transcriptomics data, lack an understanding of cell identity\\nknowledge. As a result, they have to be fine-tuned for downstream tasks and\\nstruggle when lacking labeled data with the desired semantic labels. To address\\nthis issue, we propose an innovative solution by constructing a unified\\nrepresentation of single-cell data and natural language during the pre-training\\nphase, allowing the model to directly incorporate insights related to cell\\nidentity. More specifically, we introduce \\\\textbf{LangCell}, the first\\n\\\\textbf{Lang}uage-\\\\textbf{Cell} pre-training framework. LangCell utilizes texts\\nenriched with cell identity information to gain a profound comprehension of\\ncross-modal knowledge. Results from experiments conducted on different\\nbenchmarks show that LangCell is the only single-cell PLM that can work\\neffectively in zero-shot cell identity understanding scenarios, and also\\nsignificantly outperforms existing models in few-shot and fine-tuning cell\\nidentity understanding scenarios.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"189 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.06708\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.06708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LangCell: Language-Cell Pre-training for Cell Identity Understanding
Cell identity encompasses various semantic aspects of a cell, including cell
type, pathway information, disease information, and more, which are essential
for biologists to gain insights into its biological characteristics.
Understanding cell identity from the transcriptomic data, such as annotating
cell types, have become an important task in bioinformatics. As these semantic
aspects are determined by human experts, it is impossible for AI models to
effectively carry out cell identity understanding tasks without the supervision
signals provided by single-cell and label pairs. The single-cell pre-trained
language models (PLMs) currently used for this task are trained only on a
single modality, transcriptomics data, lack an understanding of cell identity
knowledge. As a result, they have to be fine-tuned for downstream tasks and
struggle when lacking labeled data with the desired semantic labels. To address
this issue, we propose an innovative solution by constructing a unified
representation of single-cell data and natural language during the pre-training
phase, allowing the model to directly incorporate insights related to cell
identity. More specifically, we introduce \textbf{LangCell}, the first
\textbf{Lang}uage-\textbf{Cell} pre-training framework. LangCell utilizes texts
enriched with cell identity information to gain a profound comprehension of
cross-modal knowledge. Results from experiments conducted on different
benchmarks show that LangCell is the only single-cell PLM that can work
effectively in zero-shot cell identity understanding scenarios, and also
significantly outperforms existing models in few-shot and fine-tuning cell
identity understanding scenarios.