A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-06-19 DOI:10.1186/s13040-025-00459-4

Yasaman Fatapour, James P Brody

{"title":"A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.","authors":"Yasaman Fatapour, James P Brody","doi":"10.1186/s13040-025-00459-4","DOIUrl":null,"url":null,"abstract":"<p><p>Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person's genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person's phenotype through the application of machine learning. We characterized a person's genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1-22, as Male/Female (AUC = 0.9988 ± 0.0001), White/Black (AUC = 0.970 ± 0.002), Asian/White (AUC = 0.877 ± 0.002), and Black/Asian (AUC = 0.966 ± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person's genotype, along with machine learning, can effectively predict a person's phenotype.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"44"},"PeriodicalIF":6.1000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180147/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00459-4","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person's genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person's phenotype through the application of machine learning. We characterized a person's genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1-22, as Male/Female (AUC = 0.9988 ± 0.0001), White/Black (AUC = 0.970 ± 0.002), Asian/White (AUC = 0.877 ± 0.002), and Black/Asian (AUC = 0.966 ± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person's genotype, along with machine learning, can effectively predict a person's phenotype.

Abstract Image

查看原文本刊更多论文

一种紧凑的基因组编码，适合机器学习预测性状和遗传风险评分。

基因型到表型的预测是生物学和医学中的一个核心问题。机器学习是解决这个问题的自然工具。然而，一个人的基因型通常由几百万个单核苷酸多态性代表，而大多数数据集只有几千个患者。因此，这个问题通常具有比样本（患者）数量更多的预测因子，这使得它不适合机器学习。本文的目的是研究紧凑的基因型表示的有效性，该表示采用有限数量的预测因子，通过应用机器学习来预测一个人的表型。我们使用染色体尺度长度变异来表征一个人的基因型，这是一种测量方法，计算为报告的对数R比在染色体部分上的平均值。我们根据美国国立卫生研究院“我们所有人”项目收集的数据计算出这些数字。我们使用二元分类模式下的AutoML函数（h2o.ai）来识别区分男性/女性、黑人/白人、白人/亚洲人和黑人/亚洲人的最佳模型。我们还使用回归模型中的AutoML函数根据年龄和基因型预测人们的身高。结果表明，仅使用1-22号染色体的信息，我们就可以有效地将一个人分类为男性/女性（AUC = 0.9988±0.0001）、白人/黑人（AUC = 0.970±0.002）、亚洲人/白人（AUC = 0.877±0.002）和黑人/亚洲人（AUC = 0.966±0.002）。这种方法也能有效地预测身高。总之，我们已经证明，一个人的基因型的紧凑表示，以及机器学习，可以有效地预测一个人的表型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.