S. Aldworth-Yang, S.J. Coleman, K. O'Reilly, D. Catalano
{"title":"Accuracy of artificial intelligence platforms on equine topics","authors":"S. Aldworth-Yang, S.J. Coleman, K. O'Reilly, D. Catalano","doi":"10.1016/j.jevs.2025.105506","DOIUrl":null,"url":null,"abstract":"<div><div>Artificial intelligence (AI) is becoming increasingly popular as a resource for information across all topics, including equine-related areas. However, AI models pull information from a variety of sources and do not always discern between fact and opinion. The objective of this study was to evaluate accuracy of AI-generated answers on equine topics from 3 AI platforms. Our hypothesis was that AI platforms could answer basic equine questions well but would not be able to accurately answer more complex questions or topics. The 3 AI platforms (P) evaluated were Chat GPT (CGPT), Microsoft Co-Pilot (MicCP), and Extension Bot (ExtBot). Researchers asked 40 questions on general horse care, facilities management, nutrition, genetics, and reproduction (topics; T). There were 4 levels (L): beginner (beg.), intermediate (int.), advanced (adv.), and “hot topics” (HT, areas of current interest in the industry). Answers were evaluated for accuracy, relevance, thoroughness, and source quality (10 points each, total score [TS] out of 40 points). Accuracy was determined by referencing textbooks and topic experts. Data were analyzed using PROC GLM in SAS (v. 9.4). Both CGPT and MicCP answered 40 of 40 questions, whereas ExtBot answered 33 of 40 questions. Total score was not affected by P (<em>P</em> = 0.197) or T (<em>P</em> = 0.536) but there was an effect of L (<em>P</em> = 0.002). Across platforms, beg. and int. questions had a higher TS compared with adv. or HT, indicating complexity of the topic plays a role in the quality of an answer. Accuracy was affected by P (<em>P</em> < 0.001), L (<em>P</em> < 0.001), and T (<em>P</em> = 0.015). Extension Bot had a lower score than both CGPT and MicCP. HT and Adv. had lower scores than beg. or int. questions. Reproduction had a lower score compared with all other topics. Relevance was affected by P (<em>P</em> = 0.042) and L (<em>P</em> < 0.001) but not T (<em>P</em> = 0.099). Chat GPT answers contained more irrelevant information compared with MicCP and ExtBot, which may indicate a weakness in parsing out only essential information. Answers to HT questions included less relevant information compared with int. answers. Thoroughness was affected by P (<em>P</em> < 0.001) and L (<em>P</em> = 0.002), but not T (<em>P</em> = 0.282). Chat GPT was the most thorough compared with MicCP and then ExtBot. Both beg. and int. answers were more thorough than HT or adv. answers. Source quality was affected by P (<em>P</em> = 0.037) but not L (<em>P</em> = 0.645) or T (<em>P</em> = 0.558), with ExtBot using higher quality sources compared with CGPT and MicCP. Overall, the AI programs struggled with complex topics and were inconsistent in their strengths. This research demonstrates that although AI tools may have potential as resources, they currently fall short of expertise and knowledge that can be offered by equine extension specialists.</div></div>","PeriodicalId":15798,"journal":{"name":"Journal of Equine Veterinary Science","volume":"148 ","pages":"Article 105506"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Equine Veterinary Science","FirstCategoryId":"97","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0737080625001649","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"VETERINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) is becoming increasingly popular as a resource for information across all topics, including equine-related areas. However, AI models pull information from a variety of sources and do not always discern between fact and opinion. The objective of this study was to evaluate accuracy of AI-generated answers on equine topics from 3 AI platforms. Our hypothesis was that AI platforms could answer basic equine questions well but would not be able to accurately answer more complex questions or topics. The 3 AI platforms (P) evaluated were Chat GPT (CGPT), Microsoft Co-Pilot (MicCP), and Extension Bot (ExtBot). Researchers asked 40 questions on general horse care, facilities management, nutrition, genetics, and reproduction (topics; T). There were 4 levels (L): beginner (beg.), intermediate (int.), advanced (adv.), and “hot topics” (HT, areas of current interest in the industry). Answers were evaluated for accuracy, relevance, thoroughness, and source quality (10 points each, total score [TS] out of 40 points). Accuracy was determined by referencing textbooks and topic experts. Data were analyzed using PROC GLM in SAS (v. 9.4). Both CGPT and MicCP answered 40 of 40 questions, whereas ExtBot answered 33 of 40 questions. Total score was not affected by P (P = 0.197) or T (P = 0.536) but there was an effect of L (P = 0.002). Across platforms, beg. and int. questions had a higher TS compared with adv. or HT, indicating complexity of the topic plays a role in the quality of an answer. Accuracy was affected by P (P < 0.001), L (P < 0.001), and T (P = 0.015). Extension Bot had a lower score than both CGPT and MicCP. HT and Adv. had lower scores than beg. or int. questions. Reproduction had a lower score compared with all other topics. Relevance was affected by P (P = 0.042) and L (P < 0.001) but not T (P = 0.099). Chat GPT answers contained more irrelevant information compared with MicCP and ExtBot, which may indicate a weakness in parsing out only essential information. Answers to HT questions included less relevant information compared with int. answers. Thoroughness was affected by P (P < 0.001) and L (P = 0.002), but not T (P = 0.282). Chat GPT was the most thorough compared with MicCP and then ExtBot. Both beg. and int. answers were more thorough than HT or adv. answers. Source quality was affected by P (P = 0.037) but not L (P = 0.645) or T (P = 0.558), with ExtBot using higher quality sources compared with CGPT and MicCP. Overall, the AI programs struggled with complex topics and were inconsistent in their strengths. This research demonstrates that although AI tools may have potential as resources, they currently fall short of expertise and knowledge that can be offered by equine extension specialists.
期刊介绍:
Journal of Equine Veterinary Science (JEVS) is an international publication designed for the practicing equine veterinarian, equine researcher, and other equine health care specialist. Published monthly, each issue of JEVS includes original research, reviews, case reports, short communications, and clinical techniques from leaders in the equine veterinary field, covering such topics as laminitis, reproduction, infectious disease, parasitology, behavior, podology, internal medicine, surgery and nutrition.