arXiv - QuanBio - Genomics最新文献

筛选
英文 中文
CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data CAVACHON:用于整合多模态单细胞数据的分层变异自动编码器
arXiv - QuanBio - Genomics Pub Date : 2024-05-28 DOI: arxiv-2405.18655
Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer
{"title":"CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data","authors":"Ping-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer","doi":"arxiv-2405.18655","DOIUrl":"https://doi.org/arxiv-2405.18655","url":null,"abstract":"Paired single-cell sequencing technologies enable the simultaneous\u0000measurement of complementary modalities of molecular data at single-cell\u0000resolution. Along with the advances in these technologies, many methods based\u0000on variational autoencoders have been developed to integrate these data.\u0000However, these methods do not explicitly incorporate prior biological\u0000relationships between the data modalities, which could significantly enhance\u0000modeling and interpretation. We propose a novel probabilistic learning\u0000framework that explicitly incorporates conditional independence relationships\u0000between multi-modal data as a directed acyclic graph using a generalized\u0000hierarchical variational autoencoder. We demonstrate the versatility of our\u0000framework across various applications pertinent to single-cell multi-omics data\u0000integration. These include the isolation of common and distinct information\u0000from different modalities, modality-specific differential analysis, and\u0000integrated cell clustering. We anticipate that the proposed framework can\u0000facilitate the construction of highly flexible graphical models that can\u0000capture the complexities of biological hypotheses and unravel the connections\u0000between different biological data types, such as different modalities of paired\u0000single-cell multi-omics data. The implementation of the proposed framework can\u0000be found in the repository https://github.com/kuijjerlab/CAVACHON.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Range-Limited Heaps' Law for Functional DNA Words in the Human Genome 人类基因组中功能 DNA 词的范围限制希普斯定律
arXiv - QuanBio - Genomics Pub Date : 2024-05-22 DOI: arxiv-2405.13825
Wentian Li, Yannis Almirantis, Astero Provata
{"title":"Range-Limited Heaps' Law for Functional DNA Words in the Human Genome","authors":"Wentian Li, Yannis Almirantis, Astero Provata","doi":"arxiv-2405.13825","DOIUrl":"https://doi.org/arxiv-2405.13825","url":null,"abstract":"Heaps' or Herdan's law is a linguistic law describing the relationship\u0000between the vocabulary/dictionary size (type) and word counts (token) to be a\u0000power-law function. Its existence in genomes with certain definition of DNA\u0000words is unclear partly because the dictionary size in genome could be much\u0000smaller than that in a human language. We define a DNA word in a genome as a\u0000DNA coding region that codes for a protein domain. Using human chromosomes and\u0000chromosome arms as individual samples, we establish the existence of Heaps' law\u0000in the human genome within limited range. Our definition of words in a genomic\u0000or proteomic context is different from that in large language models for DNA or\u0000protein sequences where words are usually short. Although an approximate\u0000power-law distribution of protein domain sizes due to gene duplication and the\u0000related Zipf's law is well known, their translation to the Heaps' law in DNA\u0000words is not automatic. Several other animal genomes are shown herein also to\u0000exhibit range-limited Heaps' law with our definition of DNA words, though with\u0000various exponents, partially depending on their level of complexity.\u0000Investigation of Heaps' law and its exponent value could provide an alternative\u0000narrative of reusage and redundancy of protein domains as well as creation of\u0000new protein domains from a linguistic perspective.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141152918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate and efficient protein embedding using multi-teacher distillation learning 利用多教师蒸馏学习实现准确高效的蛋白质嵌入
arXiv - QuanBio - Genomics Pub Date : 2024-05-20 DOI: arxiv-2405.11735
Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun
{"title":"Accurate and efficient protein embedding using multi-teacher distillation learning","authors":"Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun","doi":"arxiv-2405.11735","DOIUrl":"https://doi.org/arxiv-2405.11735","url":null,"abstract":"Motivation: Protein embedding, which represents proteins as numerical\u0000vectors, is a crucial step in various learning-based protein\u0000annotation/classification problems, including gene ontology prediction,\u0000protein-protein interaction prediction, and protein structure prediction.\u0000However, existing protein embedding methods are often computationally expensive\u0000due to their large number of parameters, which can reach millions or even\u0000billions. The growing availability of large-scale protein datasets and the need\u0000for efficient analysis tools have created a pressing demand for efficient\u0000protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher\u0000distillation learning, which leverages the knowledge of multiple pre-trained\u0000protein embedding models to learn a compact and informative representation of\u0000proteins. Our method achieves comparable performance to state-of-the-art\u0000methods while significantly reducing computational costs and resource\u0000requirements. Specifically, our approach reduces computational time by ~70%\u0000and maintains almost the same accuracy as the original large models. This makes\u0000our method well-suited for large-scale protein analysis and enables the\u0000bioinformatics community to perform protein embedding tasks more efficiently.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification 一种自动编码器和生成式对抗网络方法用于多传感器数据不平衡类别处理和分类
arXiv - QuanBio - Genomics Pub Date : 2024-05-16 DOI: arxiv-2405.09756
Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki
{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":"https://doi.org/arxiv-2405.09756","url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\u0000of state-of-the-art machine learning methodologies has emerged as a promising\u0000research area. In molecular biology, there has been an explosion of data\u0000generated from multi-omics sequencing. The advent sequencing equipment can\u0000provide large number of complicated measurements per one experiment. Therefore,\u0000traditional statistical methods face challenging tasks when dealing with such\u0000high dimensional data. However, most of the information contained in these\u0000datasets is redundant or unrelated and can be effectively reduced to\u0000significantly fewer variables without losing much information. Dimensionality\u0000reduction techniques are mathematical procedures that allow for this reduction;\u0000they have largely been developed through statistics and machine learning\u0000disciplines. The other challenge in medical datasets is having an imbalanced\u0000number of samples in the classes, which leads to biased results in machine\u0000learning models. This study, focused on tackling these challenges in a neural\u0000network that incorporates autoencoder to extract latent space of the features,\u0000and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\u0000space is the reduced dimensional space that captures the meaningful features of\u0000the original data. Our model starts with feature selection to select the\u0000discriminative features before feeding them to the neural network. Then, the\u0000model predicts the outcome of cancer for different datasets. The proposed model\u0000outperformed other existing models by scoring accuracy of 95.09% for bladder\u0000cancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction 利用深度突变扫描微调蛋白质语言模型,提高变异效应预测能力
arXiv - QuanBio - Genomics Pub Date : 2024-05-10 DOI: arxiv-2405.06729
Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young
{"title":"Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction","authors":"Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young","doi":"arxiv-2405.06729","DOIUrl":"https://doi.org/arxiv-2405.06729","url":null,"abstract":"Protein Language Models (PLMs) have emerged as performant and scalable tools\u0000for predicting the functional impact and clinical significance of\u0000protein-coding variants, but they still lag experimental accuracy. Here, we\u0000present a novel fine-tuning approach to improve the performance of PLMs with\u0000experimental maps of variant effects from Deep Mutational Scanning (DMS) assays\u0000using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements\u0000in a held-out protein test set, and on independent DMS and clinical variant\u0000annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate\u0000that DMS is a promising source of sequence diversity and supervised training\u0000data for improving the performance of PLMs for variant effect prediction.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140935209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LangCell: Language-Cell Pre-training for Cell Identity Understanding LangCell:理解细胞特性的语言-细胞预培训
arXiv - QuanBio - Genomics Pub Date : 2024-05-09 DOI: arxiv-2405.06708
Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie
{"title":"LangCell: Language-Cell Pre-training for Cell Identity Understanding","authors":"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie","doi":"arxiv-2405.06708","DOIUrl":"https://doi.org/arxiv-2405.06708","url":null,"abstract":"Cell identity encompasses various semantic aspects of a cell, including cell\u0000type, pathway information, disease information, and more, which are essential\u0000for biologists to gain insights into its biological characteristics.\u0000Understanding cell identity from the transcriptomic data, such as annotating\u0000cell types, have become an important task in bioinformatics. As these semantic\u0000aspects are determined by human experts, it is impossible for AI models to\u0000effectively carry out cell identity understanding tasks without the supervision\u0000signals provided by single-cell and label pairs. The single-cell pre-trained\u0000language models (PLMs) currently used for this task are trained only on a\u0000single modality, transcriptomics data, lack an understanding of cell identity\u0000knowledge. As a result, they have to be fine-tuned for downstream tasks and\u0000struggle when lacking labeled data with the desired semantic labels. To address\u0000this issue, we propose an innovative solution by constructing a unified\u0000representation of single-cell data and natural language during the pre-training\u0000phase, allowing the model to directly incorporate insights related to cell\u0000identity. More specifically, we introduce textbf{LangCell}, the first\u0000textbf{Lang}uage-textbf{Cell} pre-training framework. LangCell utilizes texts\u0000enriched with cell identity information to gain a profound comprehension of\u0000cross-modal knowledge. Results from experiments conducted on different\u0000benchmarks show that LangCell is the only single-cell PLM that can work\u0000effectively in zero-shot cell identity understanding scenarios, and also\u0000significantly outperforms existing models in few-shot and fine-tuning cell\u0000identity understanding scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity 微生物群栖息地特异性中基因相互作用效应的全基因组转化器
arXiv - QuanBio - Genomics Pub Date : 2024-05-09 DOI: arxiv-2405.05998
Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus
{"title":"Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity","authors":"Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus","doi":"arxiv-2405.05998","DOIUrl":"https://doi.org/arxiv-2405.05998","url":null,"abstract":"Leveraging the vast genetic diversity within microbiomes offers unparalleled\u0000insights into complex phenotypes, yet the task of accurately predicting and\u0000understanding such traits from genomic data remains challenging. We propose a\u0000framework taking advantage of existing large models for gene vectorization to\u0000predict habitat specificity from entire microbial genome sequences. Based on\u0000our model, we develop attribution techniques to elucidate gene interaction\u0000effects that drive microbial adaptation to diverse environments. We train and\u0000validate our approach on a large dataset of high quality microbiome genomes\u0000from different habitats. We not only demonstrate solid predictive performance,\u0000but also how sequence-level information of entire genomes allows us to identify\u0000gene associations underlying complex phenotypes. Our attribution recovers known\u0000important interaction networks and proposes new candidates for experimental\u0000follow up.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology 加拿大 VirusSeq 数据门户和 Duotang:SARS-CoV-2 病毒序列和基因组流行病学开放资源
arXiv - QuanBio - Genomics Pub Date : 2024-05-08 DOI: arxiv-2405.04734
Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman
{"title":"The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology","authors":"Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman","doi":"arxiv-2405.04734","DOIUrl":"https://doi.org/arxiv-2405.04734","url":null,"abstract":"The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2\u0000genomes from patient samples to track viral evolution and inform public health\u0000response. Millions of SARS-CoV-2 genome sequences have been deposited in global\u0000public repositories. The Canadian COVID-19 Genomics Network (CanCOGeN -\u0000VirusSeq), a consortium tasked with coordinating expanded sequencing of\u0000SARS-CoV-2 genomes across Canada early in the pandemic, created the Canadian\u0000VirusSeq Data Portal, with associated data pipelines and procedures, to support\u0000these efforts. The goal of VirusSeq was to allow open access to Canadian\u0000SARS-CoV-2 genomic sequences and enhanced, standardized contextual data that\u0000were unavailable in other repositories and that meet FAIR standards (Findable,\u0000Accessible, Interoperable and Reusable). The Portal data submission pipeline\u0000contains data quality checking procedures and appropriate acknowledgement of\u0000data generators that encourages collaboration. Here we also highlight Duotang,\u0000a web platform that presents genomic epidemiology and modeling analyses on\u0000circulating and emerging SARS-CoV-2 variants in Canada. Duotang presents\u0000dynamic changes in variant composition of SARS-CoV-2 in Canada and by province,\u0000estimates variant growth, and displays complementary interactive\u0000visualizations, with a text overview of the current situation. The VirusSeq\u0000Data Portal and Duotang resources, alongside additional analyses and resources\u0000computed from the Portal (COVID-MVP, CoVizu), are all open-source and freely\u0000available. Together, they provide an updated picture of SARS-CoV-2 evolution to\u0000spur scientific discussions, inform public discourse, and support communication\u0000with and within public health authorities. They also serve as a framework for\u0000other jurisdictions interested in open, collaborative sequence data sharing and\u0000analyses.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures sc-OTGM:通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型
arXiv - QuanBio - Genomics Pub Date : 2024-05-06 DOI: arxiv-2405.03726
Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan
{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":"https://doi.org/arxiv-2405.03726","url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\u0000emerging. While these models show successful performance in cell type\u0000clustering, phenotype classification, and gene perturbation response\u0000prediction, it remains to be seen if a simpler model could achieve comparable\u0000or better results, especially with limited data. This is important, as the\u0000quantity and quality of single-cell data typically fall short of the standards\u0000in textual data used for training LLMs. Single-cell sequencing often suffers\u0000from technical artifacts, dropout events, and batch effects. These challenges\u0000are compounded in a weakly supervised setting, where the labels of cell states\u0000can be noisy, further complicating the analysis. To tackle these challenges, we\u0000present sc-OTGM, streamlined with less than 500K parameters, making it\u0000approximately 100x more compact than the foundation models, offering an\u0000efficient alternative. sc-OTGM is an unsupervised model grounded in the\u0000inductive bias that the scRNAseq data can be generated from a combination of\u0000the finite multivariate Gaussian distributions. The core function of sc-OTGM is\u0000to create a probabilistic latent space utilizing a GMM as its prior\u0000distribution and distinguish between distinct cell populations by learning\u0000their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\u0000determine the OT plan across these PDFs within the GMM framework. We evaluated\u0000our model against a CRISPR-mediated perturbation dataset, called CROP-seq,\u0000consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\u0000is effective in cell state classification, aids in the analysis of differential\u0000gene expression, and ranks genes for target identification through a\u0000recommender system. It also predicts the effects of single-gene perturbations\u0000on downstream gene regulation and generates synthetic scRNA-seq data\u0000conditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multi-Domain Multi-Task Approach for Feature Selection from Bulk RNA Datasets 从大量 RNA 数据集中选择特征的多域多任务方法
arXiv - QuanBio - Genomics Pub Date : 2024-05-04 DOI: arxiv-2405.02534
Karim Salta, Tomojit Ghosh, Michael Kirby
{"title":"A Multi-Domain Multi-Task Approach for Feature Selection from Bulk RNA Datasets","authors":"Karim Salta, Tomojit Ghosh, Michael Kirby","doi":"arxiv-2405.02534","DOIUrl":"https://doi.org/arxiv-2405.02534","url":null,"abstract":"In this paper a multi-domain multi-task algorithm for feature selection in\u0000bulk RNAseq data is proposed. Two datasets are investigated arising from mouse\u0000host immune response to Salmonella infection. Data is collected from several\u0000strains of collaborative cross mice. Samples from the spleen and liver serve as\u0000the two domains. Several machine learning experiments are conducted and the\u0000small subset of discriminative across domains features have been extracted in\u0000each case. The algorithm proves viable and underlines the benefits of across\u0000domain feature selection by extracting new subset of discriminative features\u0000which couldn't be extracted only by one-domain approach.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信