{"title":"Improved breast cancer risk prediction using chromosomal-scale length variation.","authors":"Yasaman Fatapour, James P Brody","doi":"10.1186/s40246-025-00776-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Early diagnosis of breast cancer leads to higher long-term survival rates. The development of a germline genetic test, or polygenic risk score, to identify women at high risk of breast cancer holds the potential to reduce cancer deaths. However, current tests based on SNPs do not perform much better than predictions based on family history and perform significantly worse in populations with non-European ancestry. We have developed an alternative method to characterize a genome, called chromosomal-scale length variation, which can be applied to polygenic risk scores.</p><p><strong>Objective: </strong>The objective of this paper is to characterize a breast cancer genetic risk score based on chromosomal-scale length variation using the NIH All of Us dataset in different self-identified racial groups when trained on different populations.</p><p><strong>Methods: </strong>We used the NIH All of Us dataset to compile a dataset with 4,533 women who have been diagnosed with breast cancer (including 440 who self-identified as Black) and 44,518 women who have not. We acquired, through All of Us, genetic information for each of these women. We computed a set of 88 values for each woman in the dataset, representing the chromosomal-scale length variation parameters. These numbers are average log R ratios for four different segments from each of the 22 autosomes. We used machine learning algorithms to find a model that best differentiates the women with breast cancer from the women without breast cancer based on the set of 88 numbers that characterize each woman's germline genome.</p><p><strong>Results: </strong>The best model had an AUC of 0.70 (95% CI, 0.67-0.73) in the All of Us population. Women who scored in the top quintile by this model were nine times more likely to have breast cancer when compared to women who scored in the lowest quintile.</p><p><strong>Conclusion: </strong>In conclusion, we found that this method of computing genetic risk scores for breast cancer is a substantial improvement over SNP-based polygenic risk scores. In addition, we compared models trained on populations of only White women and only Black women. We found that the models trained only on White women performed better than models trained only on Black women when tested on only White women. We did not see a significant difference between the two models when tested on only Black women.</p>","PeriodicalId":13183,"journal":{"name":"Human Genomics","volume":"19 1","pages":"65"},"PeriodicalIF":3.8000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12160350/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genomics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s40246-025-00776-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Early diagnosis of breast cancer leads to higher long-term survival rates. The development of a germline genetic test, or polygenic risk score, to identify women at high risk of breast cancer holds the potential to reduce cancer deaths. However, current tests based on SNPs do not perform much better than predictions based on family history and perform significantly worse in populations with non-European ancestry. We have developed an alternative method to characterize a genome, called chromosomal-scale length variation, which can be applied to polygenic risk scores.
Objective: The objective of this paper is to characterize a breast cancer genetic risk score based on chromosomal-scale length variation using the NIH All of Us dataset in different self-identified racial groups when trained on different populations.
Methods: We used the NIH All of Us dataset to compile a dataset with 4,533 women who have been diagnosed with breast cancer (including 440 who self-identified as Black) and 44,518 women who have not. We acquired, through All of Us, genetic information for each of these women. We computed a set of 88 values for each woman in the dataset, representing the chromosomal-scale length variation parameters. These numbers are average log R ratios for four different segments from each of the 22 autosomes. We used machine learning algorithms to find a model that best differentiates the women with breast cancer from the women without breast cancer based on the set of 88 numbers that characterize each woman's germline genome.
Results: The best model had an AUC of 0.70 (95% CI, 0.67-0.73) in the All of Us population. Women who scored in the top quintile by this model were nine times more likely to have breast cancer when compared to women who scored in the lowest quintile.
Conclusion: In conclusion, we found that this method of computing genetic risk scores for breast cancer is a substantial improvement over SNP-based polygenic risk scores. In addition, we compared models trained on populations of only White women and only Black women. We found that the models trained only on White women performed better than models trained only on Black women when tested on only White women. We did not see a significant difference between the two models when tested on only Black women.
乳腺癌的早期诊断导致更高的长期生存率。开发种系基因测试或多基因风险评分,以确定乳腺癌高风险妇女,有可能减少癌症死亡。然而,目前基于snp的测试并不比基于家族史的预测好多少,在非欧洲血统的人群中表现更差。我们已经开发了一种替代方法来表征基因组,称为染色体尺度长度变异,它可以应用于多基因风险评分。目的:本文的目的是在对不同人群进行训练时,利用NIH All of Us数据集,在不同自我认同的种族群体中,描述基于染色体尺度长度变化的乳腺癌遗传风险评分。方法:我们使用NIH All of Us数据集汇编了一个包含4,533名被诊断患有乳腺癌的女性(包括440名自认为是黑人的女性)和44,518名未被诊断患有乳腺癌的女性的数据集。通过我们所有人,我们获得了这些女性的基因信息。我们为数据集中的每个女性计算了一组88个值,代表染色体尺度长度变化参数。这些数字是22个常染色体中四个不同片段的平均对数R比。我们使用机器学习算法来找到一个模型,该模型可以根据每个女性生殖系基因组的88个数字来区分患有乳腺癌的女性和没有乳腺癌的女性。结果:在All of Us人群中,最佳模型的AUC为0.70 (95% CI, 0.67-0.73)。在这个模型中,得分最高的五分之一的女性患乳腺癌的可能性是得分最低的五分之一的女性的九倍。结论:总之,我们发现这种计算乳腺癌遗传风险评分的方法比基于snp的多基因风险评分有了实质性的改进。此外,我们还比较了只接受过白人女性和黑人女性训练的模型。我们发现,在只接受白人女性测试时,只接受白人女性训练的模特比只接受黑人女性训练的模特表现得更好。当只对黑人女性进行测试时,我们没有发现两种模型之间的显著差异。
期刊介绍:
Human Genomics is a peer-reviewed, open access, online journal that focuses on the application of genomic analysis in all aspects of human health and disease, as well as genomic analysis of drug efficacy and safety, and comparative genomics.
Topics covered by the journal include, but are not limited to: pharmacogenomics, genome-wide association studies, genome-wide sequencing, exome sequencing, next-generation deep-sequencing, functional genomics, epigenomics, translational genomics, expression profiling, proteomics, bioinformatics, animal models, statistical genetics, genetic epidemiology, human population genetics and comparative genomics.