Metadata-guided Feature Disentanglement for Functional Genomics

Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert
{"title":"Metadata-guided Feature Disentanglement for Functional Genomics","authors":"Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert","doi":"arxiv-2405.19057","DOIUrl":null,"url":null,"abstract":"With the development of high-throughput technologies, genomics datasets\nrapidly grow in size, including functional genomics data. This has allowed the\ntraining of large Deep Learning (DL) models to predict epigenetic readouts,\nsuch as protein binding or histone modifications, from genome sequences.\nHowever, large dataset sizes come at a price of data consistency, often\naggregating results from a large number of studies, conducted under varying\nexperimental conditions. While data from large-scale consortia are useful as\nthey allow studying the effects of different biological conditions, they can\nalso contain unwanted biases from confounding experimental factors. Here, we\nintroduce Metadata-guided Feature Disentanglement (MFD) - an approach that\nallows disentangling biologically relevant features from potential technical\nbiases. MFD incorporates target metadata into model training, by conditioning\nweights of the model output layer on different experimental factors. It then\nseparates the factors into disjoint groups and enforces independence of the\ncorresponding feature subspaces with an adversarially learned penalty. We show\nthat the metadata-driven disentanglement approach allows for better model\nintrospection, by connecting latent features to experimental factors, without\ncompromising, or even improving performance in downstream tasks, such as\nenhancer prediction, or genetic variant discovery. The code for our\nimplemementation is available at https://github.com/HealthML/MFD","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.19057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD) - an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code for our implemementation is available at https://github.com/HealthML/MFD
元数据指导下的功能基因组学特征分解
随着高通量技术的发展,基因组学数据集的规模迅速扩大,其中包括功能基因组学数据。这使得人们能够训练大型深度学习(DL)模型,以预测基因组序列中的表观遗传读数,如蛋白质结合或组蛋白修饰。然而,大型数据集是以数据一致性为代价的,它往往汇集了在不同实验条件下进行的大量研究的结果。虽然来自大规模联合体的数据非常有用,因为它们可以研究不同生物条件的影响,但它们也包含了混杂实验因素带来的不必要的偏差。在这里,我们引入了元数据指导下的特征分离(MFD)--一种可以将生物相关特征与潜在技术偏差分离开来的方法。MFD 将目标元数据纳入模型训练,根据不同的实验因素对模型输出层的权重进行调节。然后,它将这些因素分成不同的组,并通过对抗性学习惩罚来加强相应特征子空间的独立性。我们的研究表明,元数据驱动的分离方法通过将潜在特征与实验因素连接起来,可以更好地进行模型内视,而不会影响甚至提高下游任务的性能,如增强子预测或遗传变异发现。我们的实现代码见 https://github.com/HealthML/MFD。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信