Metadata-guided Feature Disentanglement for Functional Genomics

arXiv - QuanBio - Genomics Pub Date : 2024-05-29 DOI:arxiv-2405.19057

Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert

{"title":"Metadata-guided Feature Disentanglement for Functional Genomics","authors":"Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert","doi":"arxiv-2405.19057","DOIUrl":null,"url":null,"abstract":"With the development of high-throughput technologies, genomics datasets\nrapidly grow in size, including functional genomics data. This has allowed the\ntraining of large Deep Learning (DL) models to predict epigenetic readouts,\nsuch as protein binding or histone modifications, from genome sequences.\nHowever, large dataset sizes come at a price of data consistency, often\naggregating results from a large number of studies, conducted under varying\nexperimental conditions. While data from large-scale consortia are useful as\nthey allow studying the effects of different biological conditions, they can\nalso contain unwanted biases from confounding experimental factors. Here, we\nintroduce Metadata-guided Feature Disentanglement (MFD) - an approach that\nallows disentangling biologically relevant features from potential technical\nbiases. MFD incorporates target metadata into model training, by conditioning\nweights of the model output layer on different experimental factors. It then\nseparates the factors into disjoint groups and enforces independence of the\ncorresponding feature subspaces with an adversarially learned penalty. We show\nthat the metadata-driven disentanglement approach allows for better model\nintrospection, by connecting latent features to experimental factors, without\ncompromising, or even improving performance in downstream tasks, such as\nenhancer prediction, or genetic variant discovery. The code for our\nimplemementation is available at https://github.com/HealthML/MFD","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.19057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD) - an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code for our implemementation is available at https://github.com/HealthML/MFD

查看原文本刊更多论文

元数据指导下的功能基因组学特征分解

随着高通量技术的发展，基因组学数据集的规模迅速扩大，其中包括功能基因组学数据。这使得人们能够训练大型深度学习（DL）模型，以预测基因组序列中的表观遗传读数，如蛋白质结合或组蛋白修饰。然而，大型数据集是以数据一致性为代价的，它往往汇集了在不同实验条件下进行的大量研究的结果。虽然来自大规模联合体的数据非常有用，因为它们可以研究不同生物条件的影响，但它们也包含了混杂实验因素带来的不必要的偏差。在这里，我们引入了元数据指导下的特征分离（MFD）--一种可以将生物相关特征与潜在技术偏差分离开来的方法。MFD 将目标元数据纳入模型训练，根据不同的实验因素对模型输出层的权重进行调节。然后，它将这些因素分成不同的组，并通过对抗性学习惩罚来加强相应特征子空间的独立性。我们的研究表明，元数据驱动的分离方法通过将潜在特征与实验因素连接起来，可以更好地进行模型内视，而不会影响甚至提高下游任务的性能，如增强子预测或遗传变异发现。我们的实现代码见 https://github.com/HealthML/MFD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量