{"title":"Machine Learning Facilitates Imputation of Gene Expression Levels across Multiple Environments","authors":"Ziang Xu, H. Qi","doi":"10.1145/3448340.3448342","DOIUrl":null,"url":null,"abstract":"Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.","PeriodicalId":365447,"journal":{"name":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3448340.3448342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.