Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie
{"title":"LangCell: Language-Cell Pre-training for Cell Identity Understanding","authors":"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie","doi":"arxiv-2405.06708","DOIUrl":null,"url":null,"abstract":"Cell identity encompasses various semantic aspects of a cell, including cell\ntype, pathway information, disease information, and more, which are essential\nfor biologists to gain insights into its biological characteristics.\nUnderstanding cell identity from the transcriptomic data, such as annotating\ncell types, have become an important task in bioinformatics. As these semantic\naspects are determined by human experts, it is impossible for AI models to\neffectively carry out cell identity understanding tasks without the supervision\nsignals provided by single-cell and label pairs. The single-cell pre-trained\nlanguage models (PLMs) currently used for this task are trained only on a\nsingle modality, transcriptomics data, lack an understanding of cell identity\nknowledge. As a result, they have to be fine-tuned for downstream tasks and\nstruggle when lacking labeled data with the desired semantic labels. To address\nthis issue, we propose an innovative solution by constructing a unified\nrepresentation of single-cell data and natural language during the pre-training\nphase, allowing the model to directly incorporate insights related to cell\nidentity. More specifically, we introduce \\textbf{LangCell}, the first\n\\textbf{Lang}uage-\\textbf{Cell} pre-training framework. LangCell utilizes texts\nenriched with cell identity information to gain a profound comprehension of\ncross-modal knowledge. Results from experiments conducted on different\nbenchmarks show that LangCell is the only single-cell PLM that can work\neffectively in zero-shot cell identity understanding scenarios, and also\nsignificantly outperforms existing models in few-shot and fine-tuning cell\nidentity understanding scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"189 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.06708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cell identity encompasses various semantic aspects of a cell, including cell
type, pathway information, disease information, and more, which are essential
for biologists to gain insights into its biological characteristics.
Understanding cell identity from the transcriptomic data, such as annotating
cell types, have become an important task in bioinformatics. As these semantic
aspects are determined by human experts, it is impossible for AI models to
effectively carry out cell identity understanding tasks without the supervision
signals provided by single-cell and label pairs. The single-cell pre-trained
language models (PLMs) currently used for this task are trained only on a
single modality, transcriptomics data, lack an understanding of cell identity
knowledge. As a result, they have to be fine-tuned for downstream tasks and
struggle when lacking labeled data with the desired semantic labels. To address
this issue, we propose an innovative solution by constructing a unified
representation of single-cell data and natural language during the pre-training
phase, allowing the model to directly incorporate insights related to cell
identity. More specifically, we introduce \textbf{LangCell}, the first
\textbf{Lang}uage-\textbf{Cell} pre-training framework. LangCell utilizes texts
enriched with cell identity information to gain a profound comprehension of
cross-modal knowledge. Results from experiments conducted on different
benchmarks show that LangCell is the only single-cell PLM that can work
effectively in zero-shot cell identity understanding scenarios, and also
significantly outperforms existing models in few-shot and fine-tuning cell
identity understanding scenarios.