Machine Learning Facilitates Imputation of Gene Expression Levels across Multiple Environments

Ziang Xu, H. Qi
{"title":"Machine Learning Facilitates Imputation of Gene Expression Levels across Multiple Environments","authors":"Ziang Xu, H. Qi","doi":"10.1145/3448340.3448342","DOIUrl":null,"url":null,"abstract":"Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.","PeriodicalId":365447,"journal":{"name":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3448340.3448342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.
机器学习促进了跨多种环境的基因表达水平的Imputation
基因表达水平反映了活细胞中活跃的生物过程。在多种环境中,基因表达水平的定量研究具有重要意义。然而,由于技术原因,在mRNA-seq、qPCR或微阵列中,由于序列多样性或技术原因,可能无法正确测量某些环境/菌株的物种表达水平。因此,如果我们能从现有的数据中推断出缺失的表达水平,这将是非常有益的,这个填补缺失值的过程被称为imputation。Imputation是机器学习中一个非常活跃的领域,许多科技公司使用Imputation来推断客户对产品/电影等的偏好。在这里,我们应用了多种最先进的计算方法,并比较了它们在预测多种环境下基因表达水平方面的表现。利用酿酒酵母在13个环境中的多环境表达数据集,我们从数据集中随机去除5%、20%、50%和75%的表达水平,并应用各种imputation方法预测缺失值,并使用均方根误差对模型性能进行比较。我们发现,在5种方法中,SVD的效果最好,其次是5近邻KNN和2近邻KNN。相比之下,单变量均值和单变量中位数效果更差,表现相似。尽管后两种单变量方法在实践中非常常用,但我们的结果强调了使用机器学习方法进行imputation的好处,可以更好地预测不同环境下的表达水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信