CHIC: A machine learning framework for inferring the presence of high-risk clonal hematopoiesis using complete blood count data from 431,531 UK Biobank participants

IF 14.6 2区 医学 Q1 HEMATOLOGY
HemaSphere Pub Date : 2025-07-03 DOI:10.1002/hem3.70169
William G. Dunn, Isabella Withnell, Muxin Gu, Pedro Quiros, Sruthi Cheloor Kovilakam, Ludovica Marando, Sean Wen, Margarete A. Fabre, Irina Mohorianu, Dragana Vuckovic, George S. Vassiliou
{"title":"CHIC: A machine learning framework for inferring the presence of high-risk clonal hematopoiesis using complete blood count data from 431,531 UK Biobank participants","authors":"William G. Dunn,&nbsp;Isabella Withnell,&nbsp;Muxin Gu,&nbsp;Pedro Quiros,&nbsp;Sruthi Cheloor Kovilakam,&nbsp;Ludovica Marando,&nbsp;Sean Wen,&nbsp;Margarete A. Fabre,&nbsp;Irina Mohorianu,&nbsp;Dragana Vuckovic,&nbsp;George S. Vassiliou","doi":"10.1002/hem3.70169","DOIUrl":null,"url":null,"abstract":"<p>Clonal hematopoiesis (CH) is an age-related phenomenon that arises when a hematopoietic stem cell acquires a somatic driver mutation (i.e., one that increases its fitness), leading to clonal expansion of the cell and its progeny.<span><sup>1, 2</sup></span> Large population-based studies have revealed that the most commonly mutated genes in CH are involved in epigenetic regulation (<i>DNMT3A</i>, <i>TET2</i>, and <i>ASXL1</i>), signal transduction (<i>JAK2</i>, <i>GNB1</i>), DNA damage response and apoptosis (<i>TP53</i>, <i>PPM1D</i>), and splicing (<i>SF3B1</i>, <i>SRSF2</i>, and <i>U2AF1</i>).<span><sup>1-6</sup></span> The prevalence of CH increases with advancing age to affect at least 20% of those over 70 years, in whom the phenomenon is almost universally detectable when deep sequencing approaches are employed.<span><sup>1-6</sup></span></p><p>A hallmark of CH is the associated increased risk of incident myeloid neoplasms (MNs), a molecularly heterogeneous group of blood cancers that include acute myeloid leukemia, myelodysplastic syndromes (MDSs), and myeloproliferative neoplasms (MPNs). Recent advances have led to the development of predictive tools that estimate the risk of progression from CH to MN,<span><sup>7, 8</sup></span> such that individuals at high risk can be identified and prioritized for clinical follow-up. As CH precedes the development of MN by several years,<span><sup>1-3, 7, 9, 10</sup></span> this provides a window during which high-risk clones could be intercepted and targeted to avert or delay the development of MN.</p><p>A key impediment to prospective myeloid cancer prevention programs is the lack of a scalable test to identify individuals with CH. At present, CH is identified by next-generation sequencing (NGS) of blood DNA targeted to a panel of genes recurrently mutated in MN. However, NGS is not performed in routine clinical practice and is impractical and costly to perform at scale. An alternative approach is to leverage low-cost, scalable, routine clinical tests to identify the individuals most likely to harbor CH, who can then be prioritized for sequencing. The complete blood count (CBC) is an inexpensive, routine clinical test, and CBC indices such as the red cell distribution width (RDW) and mean cell volume are known to be associated with progression from CH to MN.<span><sup>10</sup></span> We therefore sought to explore whether tree-based machine learning (ML) models could detect individuals with CH based on CBC features, through analysis of paired CBC and whole-exome sequencing (WES) data from 431,531 United Kingdom Biobank (UKB) participants.</p><p>After excluding those with missing CBC data (<i>n</i> = 32,670), missing WES data (<i>n</i> = 36,368), or a prevalent diagnosis of a hematological malignancy (<i>n</i> = 1840), CH variant allele frequency [VAF] ≥2%) was identified in 20,860/431,531 (4.8%) UKB participants, of whom 7637 (36.6%) had large clone CH (VAF ≥10%; Figure 1A, Table S1). Using this UKB dataset, we developed a range of tree-based models using our ML framework, which we henceforth refer to as CHIC (Clonal Hematopoiesis Inference from Counts, see Supplementary Methods).</p><p>Using CHIC, we initially developed binary classifiers (CH/no CH) agnostic to specific driver mutations (“any-driver CH”), which performed modestly (median area under the receiver operating characteristic curve [AUC] 0.62–0.64, Figure S1 and Supplementary Methods/Results). Given the molecular heterogeneity of CH, we then trained driver gene-specific classifiers and found that common mutations in epigenetic modifiers were less accurately predicted (e.g., median AUC 0.60/0.64 for <i>DNMT3A</i>/<i>TET2</i>, respectively), whereas lower prevalence, high-risk mutations in <i>JAK2</i>, <i>CALR</i>, <i>SF3B1</i>, <i>SRSF2</i>, and <i>U2AF1</i> were more accurately predicted (median AUC 0.82–0.94, Figure 1, see also Supplementary Results, Table S2). Ensemble ML algorithms performed best (Figure 1), and further analysis showed that although age and sex alone were weak predictors, incorporating these demographic features improved classifier performance, particularly for splicing factor gene mutations (Figure 1C). As such, we focused on further optimizing Random Forest (RF) models using age, sex, and CBC indices as input features in subsequent models.</p><p>Since CH with mutations in any of <i>JAK2</i>, <i>CALR</i>, <i>SF3B1</i>, <i>SRSF2</i>, or <i>U2AF1</i> was more predictable from CBC indices and more clinically relevant (associated with higher risk of progression to MN), we next combined all five genes into a single binary classifier of CH with high-risk genotype (CH-HRG), to predict the presence/absence of a mutation in any of these five high-risk genes (training on input data labeled as “CH-HRG” vs. “no CH-HRG”). The resulting median AUC was 0.85 on the unseen test set (Figure 2A); with performance further enhanced when predicting the presence of large (VAF ≥10%) CH-HRG clones (median AUC on unseen test set 0.90, Figure S2). By performing iterative feature selection, we developed a compact CH-HRG model that demonstrated stable performance when using only age and five CBC indices as input features (Figure 2B, Figure S3).</p><p>Next, we assessed the optimal probability score cutoff (threshold) for our compact CH-HRG model by examining the trade-off between sensitivity and positive predictive value (PPV) (Figure 2C). In our UKB cohort, CH-HRG was rare (795/431,531 UKB participants, prevalence 0.18%): since the PPV is strongly influenced by the prevalence of positive cases, this necessitated the use of a stringent probability score cutoff to minimize the number of false positives. To achieve this, we chose a cutoff probability of 0.925 (i.e., our classifier predicts the presence of CH-HRG when the predicted probability of CH-HRG is ≥0.925). Using this threshold resulted in a PPV of 8.1% and sensitivity of 20.1% in our unseen test cohort (<i>n</i> = 86,306), while maintaining the specificity and negative predictive value of &gt;99.5% (Tables S3 and S4). We did not observe evidence of variation in performance of the classifier by self-reported ancestry despite variation in the distribution of CBC indices (Supplementary Results, Figure S4, Tables S5 and S6); however, 94.1% of our final cohort described their ancestry as European.</p><p>A key limitation for the identification of CH in the UKB is the low coverage of WES, with driver genes <i>JAK2</i>, <i>SF3B1</i>, and <i>U2AF1</i> all having a median sequencing coverage of ≤31 reads,<span><sup>7</sup></span> which limits our ability to identify small CH clones (e.g., with VAFs &lt;2%). It is plausible that even small clones may be associated with CBC changes, so we examined long-term outcomes for the 365 “false positive” cases identified by our CH-HRG classifier and found that 38/365 (10.4%) developed MN at a median of 5.2 years from sampling. By contrast, only 317/85,782 (0.4%) of “true negatives” developed MN. Since CH is the shared precursor of the vast majority of MNs, these observations strongly suggest that a subset of “false positive” individuals had CH below the limit of detection of WES.</p><p>To further explore this hypothesis, we searched for low VAF hotspot mutations involving any of our five high-risk genes among 38 individuals who developed MN, but were not found to have such a hotspot mutation by standard variant calling. To do so, we used “pileup” to detect mutant reads at this hotspot that were filtered out by the stringent criteria of our standard calling pipeline. This revealed that 13 of 38 apparently false-positive UKB participants who developed incident MN had detectable CH mutations by this method, including 11 with driver mutations in <i>JAK2</i>, a low-coverage gene. This strongly suggests that we underestimated our model performance due to the limitations of WES. In addition to cases below the limit of detection, we also examined our false-positive cases for lower VAF mutations in a high-risk gene and identified six cases bearing CH-HRG mutations (two cases each of <i>JAK2</i>, <i>SF3B1</i>, and <i>SRSF2</i>-CH) that were classified under their highest VAF mutation as <i>DNMT3A</i>, <i>TET2</i>, or <i>ASXL1-</i>CH.</p><p>Further examination of cases identified by CHIC revealed an enrichment in cases with thrombocytosis, suggestive of undiagnosed or unannotated MPN rather than CH (Figure S5). Similarly, a few cases had cytopenias that would fall into the diagnostic criteria for clonal cytopenia of undetermined significance (CCUS) or MDS.<span><sup>11</sup></span> To overcome this, we first constrained our training/test sets to individuals without cytopenias (hemoglobin &lt;12/13 g/dL for males/females, respectively, neutrophils &lt; 1.8 × 10<sup>9</sup>/L, platelets &lt;150 × 10<sup>9</sup>/L), thrombocytosis (platelets &gt; 450 × 10<sup>9</sup>/L), or erythrocytosis (hemoglobin &gt;16.5/16 g/dL or hematocrit &gt;49/48% for males/females, respectively), thereby excluding possible undiagnosed CCUS/MDS/MPN cases, and retrained our CH-HRG classifier. This led to only a minor reduction in performance (median AUC on unseen test set 0.80, Figure S6); however, this exacerbated the trade-off between sensitivity and PPV, leading to sensitivity and PPV of only 11.3% and 2.0%, respectively, at our proposed cutoff probability for predicting CH-HRG of 0.875 (Figure S6D).</p><p>Next, considering the challenges posed by applying CHIC in an unselected population, we postulated that the performance may be improved if we targeted the use of CHIC to a population with abnormal CBC indices, where the prevalence of CH-HRG is enriched. We therefore investigated the performance of CHIC on the 9576 UKB participants with thrombocytopenia (CH-HRG was present in 53/9576, prevalence 0.45%, representing 2.5-fold enrichment). We found that CHIC performed strongly (median AUC 0.93) in this setting, and a more lenient threshold could be applied in view of a more favorable sensitivity/PPV trade-off (Figure 2D, Supplementary Results, Figure S7). In addition to predicting CH-HRG, we also considered that CHIC could be used to identify the presence of high-risk CH as determined by recently developed MN risk prediction tools.<span><sup>7, 8</sup></span> By training a classifier using labels based on risk score (10-year MN risk ≥10%) rather than genotype, we could robustly differentiate between high-risk CH and controls (median AUC 0.96, see Supplementary Results, Figure S8), but as these risk stratification tools were trained on UKB blood count data, this strong performance may arise from overfitting and requires validation in an independent cohort.</p><p>We developed an ML framework and assessed an RF classifier that predicts the presence of CH-HRG from just five CBC variables and the individual's age. This approach, named CHIC, can discriminate between individuals with and without mutations in five CH genes associated with high risk of developing MN. Notably, CHIC retained the ability to discriminate high-risk CH cases from controls even among individuals without cytopenias, erythrocytosis, or thrombocytosis, suggesting that it can highlight individuals that may not otherwise come to medical attention. CHIC is an important first step towards developing a scalable screening test to identify individuals likely to harbor high-risk CH, who would then be prioritized for targeted NGS. Clinically, this could be utilized to reduce the number needed to sequence (NNS) to identify one case of CH-HRG, thus making screening at scale more feasible and justifying the need to perform genetic testing. Even with its current limitations, the use of CHIC with a stringent cutoff probability on individuals without cytopenia or thrombo-/erythrocytosis would still markedly reduce the NNS from 727 to 40 individuals per case of high-risk CH (based on the prevalence of high-risk CH in an unselected population vs. in those predicted to have high-risk CH by CHIC). Also, when we applied CHIC without constraints on CBC indices, it identified individuals with high-risk mutations and indices consistent with CCUS/MDS or MPN, rather than CH, suggesting it could also be utilized to identify undiagnosed individuals without relying on clinician recognition and referral.</p><p>However, despite its promising metrics, the performance of CHIC in an unselected population was limited by the rarity of high-risk CH, necessitating the ceding of sensitivity to achieve an acceptable PPV. One approach for enhancing the performance of CHIC is to target its use to a population with a higher prevalence of high-risk CH. For example, targeting CHIC to a thrombocytopenic cohort substantially ameliorated the trade-off between sensitivity and PPV. We anticipate that CHIC will generalize to specific contexts where mutations in <i>JAK2</i>, <i>CALR</i>, <i>SF3B1</i>, <i>SRSF2</i>, and <i>U2AF1</i> predominate (e.g., thrombo/erythrocytosis or cytopenias), although in other “high-risk” contexts, such as detecting clonal expansions post-chemotherapy, CHIC may not generalize well since both the mutational (<i>TP53</i> and <i>PPM1D</i>-enriched) and CBC (therapy-related CBC perturbations) landscape differ substantially from the context in which CHIC was trained and optimized. We expect that CHIC would be best applied to community-dwelling adults attending primary care, since an inpatient population would be expected to have higher rates of inflammation and infection that could perturb CBC indices and detrimentally affect model performance.</p><p>An alternative approach to improve performance would be to integrate higher resolution CBC data into the CHIC classifier, since the most discriminative CBC indices for high-risk CH are derived summary statistics calculated from single-cell measurements (e.g., RDW, platelet disribution width, and mean cell hemoglobin). The use of embeddings of raw single-cell measurements has the potential to improve the prediction of high-risk CH, for example, by revealing the presence of a bimodal distribution in cell size distribution arising from a clonal population of cells with distinct indices or identifying other characteristic patterns of variation in these measurements. Such raw (or “non-classical”) CBC traits have recently been exploited to explore genetic associations with blood cell morphology.<span><sup>12</sup></span> By retrofitting CH-HRG screening onto a routine blood test, we believe our CHIC approach presents an important step towards scalable, practical, and inexpensive ML-based screening for CH-HRG and provides a proof-of-concept that individuals with CH-HRG can be differentiated from those without, based on CBC indices.</p><p><b>William G. Dunn</b>: Writing—review and editing; writing—original draft; investigation; methodology; visualization; software; formal analysis. <b>Isabella Withnell</b>: Investigation; methodology; writing—review and editing; formal analysis; software. <b>Muxin Gu</b>: Methodology. <b>Pedro Quiros</b>: Methodology. <b>Sruthi Cheloor Kovilakam</b>: Methodology. <b>Ludovica Marando</b>: Methodology. <b>Sean Wen</b>: Methodology. <b>Margarete A. Fabre</b>: Methodology. <b>Irina Mohorianu</b>: Methodology; supervision; writing—review and editing. <b>Dragana Vuckovic</b>: Writing—review and editing; supervision; methodology; conceptualization. <b>George S. Vassiliou</b>: Conceptualization; writing—review and editing; supervision.</p><p>G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. S.W. is an employee of AstraZeneca. M.A.F. is an employee and stockholder of AstraZeneca. The other authors declare no competing interests.</p><p>W.G.D. is funded by a Clinical Research Fellowship from Cancer Research UK (CTRQQR-2021\\100012). G.S.V. receives funding from a Specialist Centre of Research grant from the Leukemia and Lymphoma Society (7035-24); he also holds a Cancer Research UK Senior Cancer Fellowship (C22324/A23015), and work in his laboratory is also funded by the Kay Kendall Leukemia Fund, Astrazeneca, Blood Cancer UK, and the Wellcome Trust.</p>","PeriodicalId":12982,"journal":{"name":"HemaSphere","volume":"9 7","pages":""},"PeriodicalIF":14.6000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/hem3.70169","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HemaSphere","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/hem3.70169","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEMATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Clonal hematopoiesis (CH) is an age-related phenomenon that arises when a hematopoietic stem cell acquires a somatic driver mutation (i.e., one that increases its fitness), leading to clonal expansion of the cell and its progeny.1, 2 Large population-based studies have revealed that the most commonly mutated genes in CH are involved in epigenetic regulation (DNMT3A, TET2, and ASXL1), signal transduction (JAK2, GNB1), DNA damage response and apoptosis (TP53, PPM1D), and splicing (SF3B1, SRSF2, and U2AF1).1-6 The prevalence of CH increases with advancing age to affect at least 20% of those over 70 years, in whom the phenomenon is almost universally detectable when deep sequencing approaches are employed.1-6

A hallmark of CH is the associated increased risk of incident myeloid neoplasms (MNs), a molecularly heterogeneous group of blood cancers that include acute myeloid leukemia, myelodysplastic syndromes (MDSs), and myeloproliferative neoplasms (MPNs). Recent advances have led to the development of predictive tools that estimate the risk of progression from CH to MN,7, 8 such that individuals at high risk can be identified and prioritized for clinical follow-up. As CH precedes the development of MN by several years,1-3, 7, 9, 10 this provides a window during which high-risk clones could be intercepted and targeted to avert or delay the development of MN.

A key impediment to prospective myeloid cancer prevention programs is the lack of a scalable test to identify individuals with CH. At present, CH is identified by next-generation sequencing (NGS) of blood DNA targeted to a panel of genes recurrently mutated in MN. However, NGS is not performed in routine clinical practice and is impractical and costly to perform at scale. An alternative approach is to leverage low-cost, scalable, routine clinical tests to identify the individuals most likely to harbor CH, who can then be prioritized for sequencing. The complete blood count (CBC) is an inexpensive, routine clinical test, and CBC indices such as the red cell distribution width (RDW) and mean cell volume are known to be associated with progression from CH to MN.10 We therefore sought to explore whether tree-based machine learning (ML) models could detect individuals with CH based on CBC features, through analysis of paired CBC and whole-exome sequencing (WES) data from 431,531 United Kingdom Biobank (UKB) participants.

After excluding those with missing CBC data (n = 32,670), missing WES data (n = 36,368), or a prevalent diagnosis of a hematological malignancy (n = 1840), CH variant allele frequency [VAF] ≥2%) was identified in 20,860/431,531 (4.8%) UKB participants, of whom 7637 (36.6%) had large clone CH (VAF ≥10%; Figure 1A, Table S1). Using this UKB dataset, we developed a range of tree-based models using our ML framework, which we henceforth refer to as CHIC (Clonal Hematopoiesis Inference from Counts, see Supplementary Methods).

Using CHIC, we initially developed binary classifiers (CH/no CH) agnostic to specific driver mutations (“any-driver CH”), which performed modestly (median area under the receiver operating characteristic curve [AUC] 0.62–0.64, Figure S1 and Supplementary Methods/Results). Given the molecular heterogeneity of CH, we then trained driver gene-specific classifiers and found that common mutations in epigenetic modifiers were less accurately predicted (e.g., median AUC 0.60/0.64 for DNMT3A/TET2, respectively), whereas lower prevalence, high-risk mutations in JAK2, CALR, SF3B1, SRSF2, and U2AF1 were more accurately predicted (median AUC 0.82–0.94, Figure 1, see also Supplementary Results, Table S2). Ensemble ML algorithms performed best (Figure 1), and further analysis showed that although age and sex alone were weak predictors, incorporating these demographic features improved classifier performance, particularly for splicing factor gene mutations (Figure 1C). As such, we focused on further optimizing Random Forest (RF) models using age, sex, and CBC indices as input features in subsequent models.

Since CH with mutations in any of JAK2, CALR, SF3B1, SRSF2, or U2AF1 was more predictable from CBC indices and more clinically relevant (associated with higher risk of progression to MN), we next combined all five genes into a single binary classifier of CH with high-risk genotype (CH-HRG), to predict the presence/absence of a mutation in any of these five high-risk genes (training on input data labeled as “CH-HRG” vs. “no CH-HRG”). The resulting median AUC was 0.85 on the unseen test set (Figure 2A); with performance further enhanced when predicting the presence of large (VAF ≥10%) CH-HRG clones (median AUC on unseen test set 0.90, Figure S2). By performing iterative feature selection, we developed a compact CH-HRG model that demonstrated stable performance when using only age and five CBC indices as input features (Figure 2B, Figure S3).

Next, we assessed the optimal probability score cutoff (threshold) for our compact CH-HRG model by examining the trade-off between sensitivity and positive predictive value (PPV) (Figure 2C). In our UKB cohort, CH-HRG was rare (795/431,531 UKB participants, prevalence 0.18%): since the PPV is strongly influenced by the prevalence of positive cases, this necessitated the use of a stringent probability score cutoff to minimize the number of false positives. To achieve this, we chose a cutoff probability of 0.925 (i.e., our classifier predicts the presence of CH-HRG when the predicted probability of CH-HRG is ≥0.925). Using this threshold resulted in a PPV of 8.1% and sensitivity of 20.1% in our unseen test cohort (n = 86,306), while maintaining the specificity and negative predictive value of >99.5% (Tables S3 and S4). We did not observe evidence of variation in performance of the classifier by self-reported ancestry despite variation in the distribution of CBC indices (Supplementary Results, Figure S4, Tables S5 and S6); however, 94.1% of our final cohort described their ancestry as European.

A key limitation for the identification of CH in the UKB is the low coverage of WES, with driver genes JAK2, SF3B1, and U2AF1 all having a median sequencing coverage of ≤31 reads,7 which limits our ability to identify small CH clones (e.g., with VAFs <2%). It is plausible that even small clones may be associated with CBC changes, so we examined long-term outcomes for the 365 “false positive” cases identified by our CH-HRG classifier and found that 38/365 (10.4%) developed MN at a median of 5.2 years from sampling. By contrast, only 317/85,782 (0.4%) of “true negatives” developed MN. Since CH is the shared precursor of the vast majority of MNs, these observations strongly suggest that a subset of “false positive” individuals had CH below the limit of detection of WES.

To further explore this hypothesis, we searched for low VAF hotspot mutations involving any of our five high-risk genes among 38 individuals who developed MN, but were not found to have such a hotspot mutation by standard variant calling. To do so, we used “pileup” to detect mutant reads at this hotspot that were filtered out by the stringent criteria of our standard calling pipeline. This revealed that 13 of 38 apparently false-positive UKB participants who developed incident MN had detectable CH mutations by this method, including 11 with driver mutations in JAK2, a low-coverage gene. This strongly suggests that we underestimated our model performance due to the limitations of WES. In addition to cases below the limit of detection, we also examined our false-positive cases for lower VAF mutations in a high-risk gene and identified six cases bearing CH-HRG mutations (two cases each of JAK2, SF3B1, and SRSF2-CH) that were classified under their highest VAF mutation as DNMT3A, TET2, or ASXL1-CH.

Further examination of cases identified by CHIC revealed an enrichment in cases with thrombocytosis, suggestive of undiagnosed or unannotated MPN rather than CH (Figure S5). Similarly, a few cases had cytopenias that would fall into the diagnostic criteria for clonal cytopenia of undetermined significance (CCUS) or MDS.11 To overcome this, we first constrained our training/test sets to individuals without cytopenias (hemoglobin <12/13 g/dL for males/females, respectively, neutrophils < 1.8 × 109/L, platelets <150 × 109/L), thrombocytosis (platelets > 450 × 109/L), or erythrocytosis (hemoglobin >16.5/16 g/dL or hematocrit >49/48% for males/females, respectively), thereby excluding possible undiagnosed CCUS/MDS/MPN cases, and retrained our CH-HRG classifier. This led to only a minor reduction in performance (median AUC on unseen test set 0.80, Figure S6); however, this exacerbated the trade-off between sensitivity and PPV, leading to sensitivity and PPV of only 11.3% and 2.0%, respectively, at our proposed cutoff probability for predicting CH-HRG of 0.875 (Figure S6D).

Next, considering the challenges posed by applying CHIC in an unselected population, we postulated that the performance may be improved if we targeted the use of CHIC to a population with abnormal CBC indices, where the prevalence of CH-HRG is enriched. We therefore investigated the performance of CHIC on the 9576 UKB participants with thrombocytopenia (CH-HRG was present in 53/9576, prevalence 0.45%, representing 2.5-fold enrichment). We found that CHIC performed strongly (median AUC 0.93) in this setting, and a more lenient threshold could be applied in view of a more favorable sensitivity/PPV trade-off (Figure 2D, Supplementary Results, Figure S7). In addition to predicting CH-HRG, we also considered that CHIC could be used to identify the presence of high-risk CH as determined by recently developed MN risk prediction tools.7, 8 By training a classifier using labels based on risk score (10-year MN risk ≥10%) rather than genotype, we could robustly differentiate between high-risk CH and controls (median AUC 0.96, see Supplementary Results, Figure S8), but as these risk stratification tools were trained on UKB blood count data, this strong performance may arise from overfitting and requires validation in an independent cohort.

We developed an ML framework and assessed an RF classifier that predicts the presence of CH-HRG from just five CBC variables and the individual's age. This approach, named CHIC, can discriminate between individuals with and without mutations in five CH genes associated with high risk of developing MN. Notably, CHIC retained the ability to discriminate high-risk CH cases from controls even among individuals without cytopenias, erythrocytosis, or thrombocytosis, suggesting that it can highlight individuals that may not otherwise come to medical attention. CHIC is an important first step towards developing a scalable screening test to identify individuals likely to harbor high-risk CH, who would then be prioritized for targeted NGS. Clinically, this could be utilized to reduce the number needed to sequence (NNS) to identify one case of CH-HRG, thus making screening at scale more feasible and justifying the need to perform genetic testing. Even with its current limitations, the use of CHIC with a stringent cutoff probability on individuals without cytopenia or thrombo-/erythrocytosis would still markedly reduce the NNS from 727 to 40 individuals per case of high-risk CH (based on the prevalence of high-risk CH in an unselected population vs. in those predicted to have high-risk CH by CHIC). Also, when we applied CHIC without constraints on CBC indices, it identified individuals with high-risk mutations and indices consistent with CCUS/MDS or MPN, rather than CH, suggesting it could also be utilized to identify undiagnosed individuals without relying on clinician recognition and referral.

However, despite its promising metrics, the performance of CHIC in an unselected population was limited by the rarity of high-risk CH, necessitating the ceding of sensitivity to achieve an acceptable PPV. One approach for enhancing the performance of CHIC is to target its use to a population with a higher prevalence of high-risk CH. For example, targeting CHIC to a thrombocytopenic cohort substantially ameliorated the trade-off between sensitivity and PPV. We anticipate that CHIC will generalize to specific contexts where mutations in JAK2, CALR, SF3B1, SRSF2, and U2AF1 predominate (e.g., thrombo/erythrocytosis or cytopenias), although in other “high-risk” contexts, such as detecting clonal expansions post-chemotherapy, CHIC may not generalize well since both the mutational (TP53 and PPM1D-enriched) and CBC (therapy-related CBC perturbations) landscape differ substantially from the context in which CHIC was trained and optimized. We expect that CHIC would be best applied to community-dwelling adults attending primary care, since an inpatient population would be expected to have higher rates of inflammation and infection that could perturb CBC indices and detrimentally affect model performance.

An alternative approach to improve performance would be to integrate higher resolution CBC data into the CHIC classifier, since the most discriminative CBC indices for high-risk CH are derived summary statistics calculated from single-cell measurements (e.g., RDW, platelet disribution width, and mean cell hemoglobin). The use of embeddings of raw single-cell measurements has the potential to improve the prediction of high-risk CH, for example, by revealing the presence of a bimodal distribution in cell size distribution arising from a clonal population of cells with distinct indices or identifying other characteristic patterns of variation in these measurements. Such raw (or “non-classical”) CBC traits have recently been exploited to explore genetic associations with blood cell morphology.12 By retrofitting CH-HRG screening onto a routine blood test, we believe our CHIC approach presents an important step towards scalable, practical, and inexpensive ML-based screening for CH-HRG and provides a proof-of-concept that individuals with CH-HRG can be differentiated from those without, based on CBC indices.

William G. Dunn: Writing—review and editing; writing—original draft; investigation; methodology; visualization; software; formal analysis. Isabella Withnell: Investigation; methodology; writing—review and editing; formal analysis; software. Muxin Gu: Methodology. Pedro Quiros: Methodology. Sruthi Cheloor Kovilakam: Methodology. Ludovica Marando: Methodology. Sean Wen: Methodology. Margarete A. Fabre: Methodology. Irina Mohorianu: Methodology; supervision; writing—review and editing. Dragana Vuckovic: Writing—review and editing; supervision; methodology; conceptualization. George S. Vassiliou: Conceptualization; writing—review and editing; supervision.

G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. S.W. is an employee of AstraZeneca. M.A.F. is an employee and stockholder of AstraZeneca. The other authors declare no competing interests.

W.G.D. is funded by a Clinical Research Fellowship from Cancer Research UK (CTRQQR-2021\100012). G.S.V. receives funding from a Specialist Centre of Research grant from the Leukemia and Lymphoma Society (7035-24); he also holds a Cancer Research UK Senior Cancer Fellowship (C22324/A23015), and work in his laboratory is also funded by the Kay Kendall Leukemia Fund, Astrazeneca, Blood Cancer UK, and the Wellcome Trust.

Abstract Image

CHIC:一个机器学习框架,用于推断高风险克隆造血的存在,使用431,531名英国生物银行参与者的全血细胞计数数据
克隆造血(CH)是一种与年龄相关的现象,当造血干细胞获得体细胞驱动突变(即增加其适应性)时,导致细胞及其后代的克隆扩增。1,2基于大量人群的研究表明,CH中最常见的突变基因涉及表观遗传调控(DNMT3A、TET2和ASXL1)、信号转导(JAK2、GNB1)、DNA损伤反应和凋亡(TP53、PPM1D)以及剪接(SF3B1、SRSF2和U2AF1)。1-6 CH的患病率随着年龄的增长而增加,70岁以上的人群中至少有20%受到影响,当采用深度测序方法时,这种现象几乎可以普遍检测到。髓系肿瘤(MNs)是一种分子异质性的血癌,包括急性髓系白血病、骨髓增生异常综合征(mds)和骨髓增生性肿瘤(mpn)。最近的进展导致了预测工具的发展,可以估计从CH到MN进展的风险7,8,这样可以识别高风险个体并优先进行临床随访。由于CH比MN的发展早几年,1- 3,7,9,10这提供了一个窗口期,在此期间,高风险克隆可以被拦截和靶向,以避免或延迟MN的发展。前瞻性髓系癌预防计划的一个关键障碍是缺乏一种可扩展的检测方法来识别CH患者。目前,CH是通过针对MN中反复突变的一组基因的下一代血液DNA测序(NGS)来识别的。然而,NGS没有在常规临床实践中进行,而且大规模实施不切实际且成本高昂。另一种方法是利用低成本、可扩展的常规临床测试来识别最有可能携带CH的个体,然后优先进行测序。全血细胞计数(CBC)是一种廉价的常规临床测试,CBC指标,如红细胞分布宽度(RDW)和平均细胞体积,已知与从CH到mn10的进展有关。因此,我们试图通过分析来自431,531名英国生物银行(UKB)参与者的配对CBC和全外显子组测序(WES)数据,探索基于树的机器学习(ML)模型是否可以根据CBC特征检测CH个体。在排除CBC数据缺失(n = 32,670)、WES数据缺失(n = 36,368)或血液学恶性肿瘤的普遍诊断(n = 1840)后,在20,860/431,531 (4.8%)UKB参与者中确定了CH变异等位基因频率[VAF]≥2%),其中7637(36.6%)有大克隆CH (VAF≥10%;图1A,表S1)。使用这个UKB数据集,我们使用我们的ML框架开发了一系列基于树的模型,我们今后将其称为CHIC(克隆造血推断计数,参见补充方法)。使用CHIC,我们最初开发了对特定驱动突变(“任意驱动CH”)不可知的二元分类器(CH/no CH),其表现一般(受试者工作特征曲线下的中位数面积[AUC] 0.62-0.64,图S1和补充方法/结果)。考虑到CH的分子异质性,我们随后训练驱动基因特异性分类器,发现表观遗传修饰因子中常见突变的预测准确性较低(例如,DNMT3A/TET2的中位AUC分别为0.60/0.64),而JAK2、CALR、SF3B1、SRSF2和U2AF1中患病率较低、高风险的突变的预测准确性较高(中位AUC 0.82-0.94,图1,另见补充结果,表S2)。集成ML算法表现最好(图1),进一步的分析表明,虽然年龄和性别单独是弱预测因子,但结合这些人口统计学特征可以提高分类器的性能,特别是对于剪接因子基因突变(图1C)。因此,我们专注于进一步优化随机森林(RF)模型,在后续模型中使用年龄、性别和CBC指数作为输入特征。由于携带JAK2、CALR、SF3B1、SRSF2或U2AF1突变的CH更容易从CBC指数中预测,并且更具临床相关性(与进展为MN的高风险相关),我们接下来将所有五个基因合并为具有高风险基因型的CH (CH- hrg)的单一二元分类器,以预测这五个高风险基因中任何一个突变的存在/缺失(对输入数据进行标记为“CH- hrg”与“无CH- hrg”的训练)。在未见的测试集上得到的中位AUC为0.85(图2A);当预测较大(VAF≥10%)CH-HRG克隆的存在时,性能进一步提高(未见测试集的中位AUC为0.90,图S2)。通过迭代特征选择,我们开发了一个紧凑的CH-HRG模型,当仅使用年龄和五个CBC指数作为输入特征时,该模型表现出稳定的性能(图2B,图S3)。 接下来,我们通过检查敏感性和阳性预测值(PPV)之间的权衡,评估了紧凑CH-HRG模型的最佳概率评分截止值(阈值)(图2C)。在我们的UKB队列中,CH-HRG很少见(795/431,531 UKB参与者,患病率0.18%):由于PPV受阳性病例患病率的强烈影响,这就需要使用严格的概率评分截止值来减少假阳性的数量。为此,我们选择截断概率为0.925(即当CH-HRG的预测概率≥0.925时,我们的分类器预测CH-HRG的存在)。使用该阈值,我们未见的测试队列(n = 86,306)的PPV为8.1%,敏感性为20.1%,同时保持了99.5%的特异性和阴性预测值(表S3和S4)。尽管CBC指数分布存在差异,但我们没有观察到自我报告的血统对分类器性能的影响(补充结果,图S4,表S5和表S6);然而,我们最后的队列中94.1%的人将他们的祖先描述为欧洲人。在UKB中鉴定CH的一个关键限制是WES的低覆盖率,驱动基因JAK2、SF3B1和U2AF1的中位测序覆盖率都≤31个reads,7这限制了我们鉴定小CH克隆的能力(例如,VAFs &lt;2%)。即使是小克隆也可能与CBC变化有关,因此我们检查了365例CH-HRG分类器识别的“假阳性”病例的长期结果,发现38/365(10.4%)在抽样后的中位数5.2年发生了MN。相比之下,只有317/85,782(0.4%)的“真阴性”发展为MN。由于CH是绝大多数MNs的共同前体,这些观察结果强烈表明,一部分“假阳性”个体的CH低于WES的检测极限。为了进一步探索这一假设,我们在38例MN患者中寻找涉及5个高危基因中的任何一个的低VAF热点突变,但通过标准变异体召唤未发现此类热点突变。为此,我们使用“堆积”来检测该热点上的突变读取,这些突变读取被标准调用管道的严格标准过滤掉了。结果显示,38名明显假阳性的UKB参与者中有13名发生了偶发性MN,通过这种方法可以检测到CH突变,包括11名低覆盖率基因JAK2的驱动突变。这强烈表明,由于WES的局限性,我们低估了模型的性能。除了低于检测限的病例外,我们还检查了高风险基因中较低VAF突变的假阳性病例,并确定了6例携带CH-HRG突变的病例(JAK2、SF3B1和SRSF2-CH各2例),这些病例被归类为其最高VAF突变为DNMT3A、TET2或ASXL1-CH。对CHIC鉴定的病例进行进一步检查,发现血小板增多的病例中有富集,提示未诊断或未注释的MPN而不是CH(图S5)。同样,少数病例有细胞减少,属于未确定意义的克隆性细胞减少(CCUS)或mds的诊断标准。11为了克服这一点,我们首先将训练/测试集限制为没有细胞减少的个体(男性/女性分别为血红蛋白12/13 g/dL,中性粒细胞1.8 × 109/L,血小板150 × 109/L),血小板增多(血小板450 × 109/L),或红细胞增多(血红蛋白16.5/16 g/dL或红细胞压差49/48%)。),从而排除可能的未诊断的CCUS/MDS/MPN病例,并重新训练我们的CH-HRG分类器。这只会导致性能的轻微下降(未见测试集的中位数AUC为0.80,图S6);然而,这加剧了敏感性和PPV之间的权衡,导致在我们提出的预测CH-HRG的截止概率为0.875时,敏感性和PPV分别仅为11.3%和2.0%(图S6D)。接下来,考虑到在未选择的人群中应用CHIC所带来的挑战,我们假设如果我们针对CBC指数异常的人群使用CHIC,其中CH-HRG患病率较高,则性能可能会得到改善。因此,我们研究了CHIC对9576例血小板减少UKB患者的作用(53/9576例中存在CH-HRG,患病率为0.45%,富集2.5倍)。我们发现在这种情况下,CHIC表现良好(中位AUC 0.93),考虑到更有利的灵敏度/PPV权衡,可以应用更宽松的阈值(图2D,补充结果,图S7)。除了预测CH- hrg外,我们还认为CHIC可用于识别最近开发的MN风险预测工具确定的高风险CH的存在。 7,8通过使用基于风险评分(10年MN风险≥10%)而不是基因型的标签来训练分类器,我们可以稳健地区分高风险CH和对照组(中位AUC 0.96,见补充结果,图S8),但由于这些风险分层工具是根据UKB血细胞计数数据训练的,因此这种强大的性能可能来自过拟合,需要在独立队列中进行验证。我们开发了一个ML框架,并评估了一个RF分类器,该分类器仅从五个CBC变量和个人年龄来预测CH-HRG的存在。这种方法被命名为CHIC,可以区分与MN高风险相关的5种CH基因是否存在突变。值得注意的是,CHIC保留了区分高危CH病例和对照组的能力,甚至在没有血细胞减少症、红细胞增多症或血小板增多症的个体中也是如此,这表明它可以突出那些可能不去就医的个体。CHIC是开发可扩展筛选测试的重要第一步,以识别可能携带高风险CH的个体,然后优先进行靶向NGS。在临床上,这可以用来减少识别一个CH-HRG病例所需的序列(NNS)数量,从而使大规模筛查更加可行,并证明进行基因检测的必要性。即使有其目前的局限性,在没有细胞减少症或血栓/红细胞增多症的个体上使用严格的截止概率CHIC,仍然会显着将NNS从每例高风险CH的727例降低到40例(基于未选择人群中高风险CH的患病率与CHIC预测的高风险CH的患病率)。此外,当我们在不受CBC指标限制的情况下应用CHIC时,它识别出具有高风险突变的个体和与CCUS/MDS或MPN一致的指标,而不是CH,这表明它也可以用于识别未确诊的个体,而不依赖于临床医生的识别和转诊。然而,尽管其指标很有希望,但在未选择的人群中,CHIC的表现受到高风险CH的罕见性的限制,需要放弃敏感性以达到可接受的PPV。提高CHIC疗效的一种方法是将其应用于高危CH患病率较高的人群。例如,将CHIC应用于血小板减少的人群,可显著改善敏感性和PPV之间的权衡。我们预计,CHIC将推广到JAK2、CALR、SF3B1、SRSF2和U2AF1突变占主导地位的特定情况(例如,血栓/红细胞增多或细胞减少),尽管在其他“高风险”情况下,如检测化疗后克隆扩增,CHIC可能不会很好地推广,因为突变(TP53和ppm1d富集)和CBC(治疗相关的CBC干扰)的情况与CHIC训练和优化的情况有很大不同。我们预计CHIC最适用于社区居住的接受初级保健的成年人,因为住院患者可能有更高的炎症和感染率,这可能会扰乱CBC指数并对模型性能产生不利影响。提高性能的另一种方法是将更高分辨率的CBC数据整合到CHIC分类器中,因为高风险CH的最具歧视性的CBC指数是从单细胞测量(例如,RDW,血小板分布宽度和平均细胞血红蛋白)计算得出的汇总统计数据。使用原始单细胞测量的嵌入有可能改善高风险CH的预测,例如,通过揭示具有不同指数的克隆细胞群体产生的细胞大小分布中双峰分布的存在或识别这些测量中的其他特征变化模式。这些原始的(或“非经典的”)CBC特征最近被用来探索与血细胞形态的遗传关联通过将CH-HRG筛查改造为常规血液检查,我们相信我们的CHIC方法向可扩展、实用、廉价的基于ml的CH-HRG筛查迈出了重要一步,并提供了基于CBC指数区分CH-HRG患者与非CH-HRG患者的概念证明。威廉·邓恩:写作、评论和编辑;原创作品草案;调查;方法;可视化;软件;正式的分析。伊莎贝拉·威斯内尔:调查;方法;写作——审阅和编辑;正式的分析;软件。顾木新:方法论。Pedro Quiros:方法论。Sruthi Cheloor Kovilakam:方法论。Ludovica Marando:方法论。肖恩·温:方法论。玛格丽特·a·法布尔:方法论。Irina Mohorianu:方法论;监督;写作-审查和编辑。Dragana Vuckovic:写作、评论和编辑;监督;方法;概念化。乔治·S。 Vassiliou:概念化;写作——审阅和编辑;supervision.G.S.V。是STRM的顾问。并持有阿斯利康的研究资助,用于与这里介绍的研究无关的研究。S.W.是阿斯利康公司的雇员。M.A.F.是阿斯利康公司的雇员和股东。其他作者声明没有竞争利益。由英国癌症研究中心临床研究奖学金(CTRQQR-2021\100012)资助。G.S.V.从白血病和淋巴瘤协会(7035-24)的专家研究中心获得资助;他还拥有英国癌症研究中心高级癌症奖学金(C22324/A23015),他的实验室工作也得到了凯肯德尔白血病基金、阿斯利康、英国血癌和威康信托基金的资助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
HemaSphere
HemaSphere Medicine-Hematology
CiteScore
6.10
自引率
4.50%
发文量
2776
审稿时长
7 weeks
期刊介绍: HemaSphere, as a publication, is dedicated to disseminating the outcomes of profoundly pertinent basic, translational, and clinical research endeavors within the field of hematology. The journal actively seeks robust studies that unveil novel discoveries with significant ramifications for hematology. In addition to original research, HemaSphere features review articles and guideline articles that furnish lucid synopses and discussions of emerging developments, along with recommendations for patient care. Positioned as the foremost resource in hematology, HemaSphere augments its offerings with specialized sections like HemaTopics and HemaPolicy. These segments engender insightful dialogues covering a spectrum of hematology-related topics, including digestible summaries of pivotal articles, updates on new therapies, deliberations on European policy matters, and other noteworthy news items within the field. Steering the course of HemaSphere are Editor in Chief Jan Cools and Deputy Editor in Chief Claire Harrison, alongside the guidance of an esteemed Editorial Board comprising international luminaries in both research and clinical realms, each representing diverse areas of hematologic expertise.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信