{"title":"CattleAssigner:使用最小信息标记准确分配牛系和种群个体的框架","authors":"","doi":"10.1016/j.compag.2024.109427","DOIUrl":null,"url":null,"abstract":"<div><p>Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (<em>African</em> or <em>European</em> or <em>Indicine</em> or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (F<sub>ST</sub>), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance <em>vis-à-vis</em> animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected <em>via</em> RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected <em>via</em> RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected <em>via</em> RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to <em>African</em> lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected <em>via</em> PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to <em>European</em> breeds using 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of <em>Indicine</em> lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.</p></div>","PeriodicalId":50627,"journal":{"name":"Computers and Electronics in Agriculture","volume":null,"pages":null},"PeriodicalIF":7.7000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0168169924008184/pdfft?md5=4535ed2f387437d2dd815e0adadccc6a&pid=1-s2.0-S0168169924008184-main.pdf","citationCount":"0","resultStr":"{\"title\":\"CattleAssigner: A framework for accurate assignment of individuals to cattle lineages and populations using minimum informative markers\",\"authors\":\"\",\"doi\":\"10.1016/j.compag.2024.109427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (<em>African</em> or <em>European</em> or <em>Indicine</em> or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (F<sub>ST</sub>), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance <em>vis-à-vis</em> animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected <em>via</em> RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected <em>via</em> RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected <em>via</em> RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to <em>African</em> lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected <em>via</em> PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to <em>European</em> breeds using 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of <em>Indicine</em> lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected <em>via</em> F<sub>ST</sub> method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.</p></div>\",\"PeriodicalId\":50627,\"journal\":{\"name\":\"Computers and Electronics in Agriculture\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0168169924008184/pdfft?md5=4535ed2f387437d2dd815e0adadccc6a&pid=1-s2.0-S0168169924008184-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers and Electronics in Agriculture\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0168169924008184\",\"RegionNum\":1,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Electronics in Agriculture","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0168169924008184","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
CattleAssigner: A framework for accurate assignment of individuals to cattle lineages and populations using minimum informative markers
Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (African or European or Indicine or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (FST), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance vis-à-vis animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected via RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected via RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected via RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to African lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected via PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to European breeds using 192-SNP panel (selected via FST method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of Indicine lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected via FST method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.
期刊介绍:
Computers and Electronics in Agriculture provides international coverage of advancements in computer hardware, software, electronic instrumentation, and control systems applied to agricultural challenges. Encompassing agronomy, horticulture, forestry, aquaculture, and animal farming, the journal publishes original papers, reviews, and applications notes. It explores the use of computers and electronics in plant or animal agricultural production, covering topics like agricultural soils, water, pests, controlled environments, and waste. The scope extends to on-farm post-harvest operations and relevant technologies, including artificial intelligence, sensors, machine vision, robotics, networking, and simulation modeling. Its companion journal, Smart Agricultural Technology, continues the focus on smart applications in production agriculture.