CattleAssigner:使用最小信息标记准确分配牛系和种群个体的框架

IF 7.7 1区 农林科学 Q1 AGRICULTURE, MULTIDISCIPLINARY
{"title":"CattleAssigner:使用最小信息标记准确分配牛系和种群个体的框架","authors":"","doi":"10.1016/j.compag.2024.109427","DOIUrl":null,"url":null,"abstract":"<div><p>Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (<em>African</em> or <em>European</em> or <em>Indicine</em> or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (F<sub>ST</sub>), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance <em>vis-à-vis</em> animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected <em>via</em> RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected <em>via</em> RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected <em>via</em> RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to <em>African</em> lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected <em>via</em> PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to <em>European</em> breeds using 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of <em>Indicine</em> lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.</p></div>","PeriodicalId":50627,"journal":{"name":"Computers and Electronics in Agriculture","volume":null,"pages":null},"PeriodicalIF":7.7000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0168169924008184/pdfft?md5=4535ed2f387437d2dd815e0adadccc6a&pid=1-s2.0-S0168169924008184-main.pdf","citationCount":"0","resultStr":"{\"title\":\"CattleAssigner: A framework for accurate assignment of individuals to cattle lineages and populations using minimum informative markers\",\"authors\":\"\",\"doi\":\"10.1016/j.compag.2024.109427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (<em>African</em> or <em>European</em> or <em>Indicine</em> or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (F<sub>ST</sub>), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance <em>vis-à-vis</em> animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected <em>via</em> RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected <em>via</em> RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected <em>via</em> RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to <em>African</em> lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected <em>via</em> PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to <em>European</em> breeds using 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of <em>Indicine</em> lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.</p></div>\",\"PeriodicalId\":50627,\"journal\":{\"name\":\"Computers and Electronics in Agriculture\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0168169924008184/pdfft?md5=4535ed2f387437d2dd815e0adadccc6a&pid=1-s2.0-S0168169924008184-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers and Electronics in Agriculture\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0168169924008184\",\"RegionNum\":1,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Electronics in Agriculture","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0168169924008184","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

将动物个体归入各自的品种、种群或品系,除了可以发现潜在的遗传变异,促进这些品种适应不同的环境条件之外,还对全球牛群的进化分析具有重要意义。它对于发现牛群遗传变异的地理模式以及追溯牛种、食品和疾病的地理起源也很重要。有鉴于此,本研究利用两种著名的监督机器学习(ML)算法,即随机森林(RF)和极端梯度提升(XGBoost),对最初使用中密度 BovineSNP50 BeadChip 对代表 73 个种群的 1823 个个体生成的信息量最小的单核苷酸多态性(SNP)标记进行了阐明,以便将动物个体归入相应的系/群(非洲牛、欧洲牛、印地安牛或混血牛)以及该系/群中的相应种群。这两种 ML 模型分别使用信息量最大的 SNP 面板(大小分别为 48、96 和 192)进行训练,这些面板是通过两种统计方法(即主成分分析(PCA)和赖特固定指数(FST))以及两种 ML 方法(RF 与 Gini 和 RF 与 MDA)阐明的。每种标记预选方法都创建了三个具有最高判别能力 SNP 的面板(密度分别为 192、96 和 48)。根据动物在各系、种群组或种群中的分配情况,对这些面板进行了评估。结果表明,XGBoost 使用 192-SNP 面板(通过 RF 与 MDA 进行选择)实现了 95% 的最佳准确率,其次是 RF(93% 的准确率)使用 192-SNP 面板(通过 RF 与 Gini 或 MDA 进行选择)实现了动物到品系的分配。同样,使用 48-SNP 面板(通过使用 Gini 算法的 RF 选择)训练的 RF 在将动物分配到非洲血统方面达到了 97% 的最佳准确率,而使用 96-SNP 面板(通过 PCA 选择)将动物分配到混血种群方面达到了 89% 的最佳准确率。另一方面,XGBoost 在使用 192-SNP 面板(通过 FST 方法选择)将动物归入欧洲品种时达到了 88% 的最佳准确率。此外,RF 和 XGBoost 的结果在将印地安血统的动物归入正确组别方面表现不佳,因为使用 RF 和 192-SNP 面板(通过 FST 方法选择)进行归类的最佳准确率为 66%。总之,该研究报告了统计和 ML 方法在鉴定鉴别性 SNPs 方面的适用性,除了揭示 XGBoost 和基于 RF 的 ML 模型在执行此类分配时的效率外,还有助于将个体分配到相应的品系和品系内的相应种群。与统计模型相比,这两种 ML 模型在将动物分配到特定系谱方面都取得了更好的成绩,而在将个体分配到各自系谱或种群组内的相应种群方面,它们的表现则相当接近。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CattleAssigner: A framework for accurate assignment of individuals to cattle lineages and populations using minimum informative markers

Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (African or European or Indicine or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (FST), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance vis-à-vis animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected via RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected via RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected via RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to African lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected via PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to European breeds using 192-SNP panel (selected via FST method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of Indicine lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected via FST method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computers and Electronics in Agriculture
Computers and Electronics in Agriculture 工程技术-计算机:跨学科应用
CiteScore
15.30
自引率
14.50%
发文量
800
审稿时长
62 days
期刊介绍: Computers and Electronics in Agriculture provides international coverage of advancements in computer hardware, software, electronic instrumentation, and control systems applied to agricultural challenges. Encompassing agronomy, horticulture, forestry, aquaculture, and animal farming, the journal publishes original papers, reviews, and applications notes. It explores the use of computers and electronics in plant or animal agricultural production, covering topics like agricultural soils, water, pests, controlled environments, and waste. The scope extends to on-farm post-harvest operations and relevant technologies, including artificial intelligence, sensors, machine vision, robotics, networking, and simulation modeling. Its companion journal, Smart Agricultural Technology, continues the focus on smart applications in production agriculture.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信