Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert
{"title":"Metadata-guided Feature Disentanglement for Functional Genomics","authors":"Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert","doi":"arxiv-2405.19057","DOIUrl":null,"url":null,"abstract":"With the development of high-throughput technologies, genomics datasets\nrapidly grow in size, including functional genomics data. This has allowed the\ntraining of large Deep Learning (DL) models to predict epigenetic readouts,\nsuch as protein binding or histone modifications, from genome sequences.\nHowever, large dataset sizes come at a price of data consistency, often\naggregating results from a large number of studies, conducted under varying\nexperimental conditions. While data from large-scale consortia are useful as\nthey allow studying the effects of different biological conditions, they can\nalso contain unwanted biases from confounding experimental factors. Here, we\nintroduce Metadata-guided Feature Disentanglement (MFD) - an approach that\nallows disentangling biologically relevant features from potential technical\nbiases. MFD incorporates target metadata into model training, by conditioning\nweights of the model output layer on different experimental factors. It then\nseparates the factors into disjoint groups and enforces independence of the\ncorresponding feature subspaces with an adversarially learned penalty. We show\nthat the metadata-driven disentanglement approach allows for better model\nintrospection, by connecting latent features to experimental factors, without\ncompromising, or even improving performance in downstream tasks, such as\nenhancer prediction, or genetic variant discovery. The code for our\nimplemementation is available at https://github.com/HealthML/MFD","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.19057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the development of high-throughput technologies, genomics datasets
rapidly grow in size, including functional genomics data. This has allowed the
training of large Deep Learning (DL) models to predict epigenetic readouts,
such as protein binding or histone modifications, from genome sequences.
However, large dataset sizes come at a price of data consistency, often
aggregating results from a large number of studies, conducted under varying
experimental conditions. While data from large-scale consortia are useful as
they allow studying the effects of different biological conditions, they can
also contain unwanted biases from confounding experimental factors. Here, we
introduce Metadata-guided Feature Disentanglement (MFD) - an approach that
allows disentangling biologically relevant features from potential technical
biases. MFD incorporates target metadata into model training, by conditioning
weights of the model output layer on different experimental factors. It then
separates the factors into disjoint groups and enforces independence of the
corresponding feature subspaces with an adversarially learned penalty. We show
that the metadata-driven disentanglement approach allows for better model
introspection, by connecting latent features to experimental factors, without
compromising, or even improving performance in downstream tasks, such as
enhancer prediction, or genetic variant discovery. The code for our
implemementation is available at https://github.com/HealthML/MFD