Combining phenotypic and genomic data to improve prediction of binary traits

IF 1.2 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics Pub Date : 2023-05-16 DOI:10.1080/02664763.2023.2208773

D. Jarquin, A. Roy, B. Clarke, S. Ghosal

{"title":"Combining phenotypic and genomic data to improve prediction of binary traits","authors":"D. Jarquin, A. Roy, B. Clarke, S. Ghosal","doi":"10.1080/02664763.2023.2208773","DOIUrl":null,"url":null,"abstract":"Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here ‘main traits’) of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or ‘phenotypes’) that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"51 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/02664763.2023.2208773","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here ‘main traits’) of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or ‘phenotypes’) that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.

查看原文本刊更多论文

结合表型和基因组数据提高二元性状的预测

植物育种家希望培育出优于现有基因型的品种。这些品种的一些特征(这里的“主要性状”)是分类的，难以直接测量。准确预测新发育基因型的主要性状具有重要意义。除了标记数据外，育种计划通常还包含易于测量的次要性状(或“表型”)信息。我们的目标是通过使用变量选择技术将两种数据类型结合起来，提高对具有可解释关系的主要性状的预测。然而，基因组特征可能压倒次要特征，因此标准技术可能无法选择任何表型变量。我们开发了一种新的统计技术，以确保二级性状和基因型变量的适当表示，以实现最佳预测。当两种数据类型(标记和二级性状)可用时，我们通过两个步骤实现了对二元性状的改进预测，这两个步骤旨在确保在考虑基因型的额外影响之前，表型的显着内在影响被纳入关系中。首先，对标记上的次要性状进行稀疏回归，并用其残差代替次要性状，得到经基因型变量调整后的表型变量效应。然后，我们利用标记和残差开发了一个稀疏逻辑分类器，以便首先选择调整后的表型，以避免因其数量优势而被基因型变量淹没。该分类器使用前向选择，并辅以惩罚项，可以通过一种称为一遍方法的技术有效地计算。在模拟数据和真实数据上与其他分类器进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Applied Statistics 数学-统计学与概率论

CiteScore

3.40

自引率

0.00%

发文量

126

审稿时长

6 months

期刊介绍： Journal of Applied Statistics provides a forum for communication between both applied statisticians and users of applied statistical techniques across a wide range of disciplines. These areas include business, computing, economics, ecology, education, management, medicine, operational research and sociology, but papers from other areas are also considered. The editorial policy is to publish rigorous but clear and accessible papers on applied techniques. Purely theoretical papers are avoided but those on theoretical developments which clearly demonstrate significant applied potential are welcomed. Each paper is submitted to at least two independent referees.