Identification of population-informative markers from high-density genotyping data through combined feature selection and machine learning algorithms: Application to European autochthonous and cosmopolitan pig breeds

IF 1.8 3区 生物学 Q2 AGRICULTURE, DAIRY & ANIMAL SCIENCE
Animal genetics Pub Date : 2024-01-08 DOI:10.1111/age.13396
Giuseppina Schiavo, Francesca Bertolini, Samuele Bovo, Giuliano Galimberti, María Muñoz, Riccardo Bozzi, Marjeta Čandek-Potokar, Cristina Óvilo, Luca Fontanesi
{"title":"Identification of population-informative markers from high-density genotyping data through combined feature selection and machine learning algorithms: Application to European autochthonous and cosmopolitan pig breeds","authors":"Giuseppina Schiavo,&nbsp;Francesca Bertolini,&nbsp;Samuele Bovo,&nbsp;Giuliano Galimberti,&nbsp;María Muñoz,&nbsp;Riccardo Bozzi,&nbsp;Marjeta Čandek-Potokar,&nbsp;Cristina Óvilo,&nbsp;Luca Fontanesi","doi":"10.1111/age.13396","DOIUrl":null,"url":null,"abstract":"<p>Large genotyping datasets, obtained from high-density single nucleotide polymorphism (SNP) arrays, developed for different livestock species, can be used to describe and differentiate breeds or populations. To identify the most discriminating genetic markers among thousands of genotyped SNPs, a few statistical approaches have been proposed. In this study, we applied the Boruta algorithm, a wrapper of the machine learning random forest algorithm, on a database of 23 European pig breeds (20 autochthonous and three cosmopolitan breeds) genotyped with a 70k SNP chip, to pre-select informative SNPs. To identify different sets of SNPs, these pre-selected markers were then ranked with random forest based on their mean decrease accuracy and mean decrease gene indexes. We evaluated the efficiency of these subsets for breed classification and the usefulness of this approach to detect candidate genes affecting breed-specific phenotypes and relevant production traits that might differ among breeds. The lowest overall classification error (2.3%) was reached with a subpanel including only 398 SNPs (ranked based on their mean decrease accuracy), with no classification error in seven breeds using up to 49 SNPs. Several SNPs of these selected subpanels were in genomic regions in which previous studies had identified signatures of selection or genes associated with morphological or production traits that distinguish the analysed breeds. Therefore, even if these approaches have not been originally designed to identify signatures of selection, the obtained results showed that they could potentially be useful for this purpose.</p>","PeriodicalId":7905,"journal":{"name":"Animal genetics","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/age.13396","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Animal genetics","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/age.13396","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Large genotyping datasets, obtained from high-density single nucleotide polymorphism (SNP) arrays, developed for different livestock species, can be used to describe and differentiate breeds or populations. To identify the most discriminating genetic markers among thousands of genotyped SNPs, a few statistical approaches have been proposed. In this study, we applied the Boruta algorithm, a wrapper of the machine learning random forest algorithm, on a database of 23 European pig breeds (20 autochthonous and three cosmopolitan breeds) genotyped with a 70k SNP chip, to pre-select informative SNPs. To identify different sets of SNPs, these pre-selected markers were then ranked with random forest based on their mean decrease accuracy and mean decrease gene indexes. We evaluated the efficiency of these subsets for breed classification and the usefulness of this approach to detect candidate genes affecting breed-specific phenotypes and relevant production traits that might differ among breeds. The lowest overall classification error (2.3%) was reached with a subpanel including only 398 SNPs (ranked based on their mean decrease accuracy), with no classification error in seven breeds using up to 49 SNPs. Several SNPs of these selected subpanels were in genomic regions in which previous studies had identified signatures of selection or genes associated with morphological or production traits that distinguish the analysed breeds. Therefore, even if these approaches have not been originally designed to identify signatures of selection, the obtained results showed that they could potentially be useful for this purpose.

Abstract Image

通过组合特征选择和机器学习算法从高密度基因分型数据中识别种群信息标记:欧洲本土猪种和世界猪种的应用。
从针对不同家畜物种开发的高密度单核苷酸多态性(SNP)阵列中获得的大型基因分型数据集可用于描述和区分品种或种群。为了在成千上万个基因分型 SNP 中找出最具鉴别力的遗传标记,人们提出了一些统计方法。在这项研究中,我们在使用 70k SNP 芯片进行基因分型的 23 个欧洲猪种(20 个本土猪种和 3 个世界性猪种)数据库中应用了 Boruta 算法(机器学习随机森林算法的包装),以预先选择有信息的 SNP。为了识别不同的 SNPs 集,然后根据平均下降准确率和平均下降基因指数,用随机森林对这些预选标记进行排序。我们评估了这些子集在品种分类方面的效率,以及这种方法在检测影响品种特异性表型的候选基因和不同品种间可能存在差异的相关生产性状方面的实用性。仅包含 398 个 SNP 的子面板(根据其平均下降准确度排序)达到了最低的总体分类误差(2.3%),而使用多达 49 个 SNP 的七个品种则没有分类误差。在这些选定的子面板中,有几个 SNPs 位于基因组区域,而在这些区域中,先前的研究已经发现了选择的特征或与形态或生产性状相关的基因,这些特征和性状区分了所分析的品种。因此,尽管这些方法最初并不是为了确定选择特征而设计的,但所获得的结果表明,它们有可能对这一目的有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Animal genetics
Animal genetics 生物-奶制品与动物科学
CiteScore
4.60
自引率
4.20%
发文量
115
审稿时长
5 months
期刊介绍: Animal Genetics reports frontline research on immunogenetics, molecular genetics and functional genomics of economically important and domesticated animals. Publications include the study of variability at gene and protein levels, mapping of genes, traits and QTLs, associations between genes and traits, genetic diversity, and characterization of gene or protein expression and control related to phenotypic or genetic variation. The journal publishes full-length articles, short communications and brief notes, as well as commissioned and submitted mini-reviews on issues of interest to Animal Genetics readers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信