缩小语言模型的巨大差异：多语言教育内容的技能标记

IF 6.7 1区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

British Journal of Educational Technology Pub Date : 2024-05-08 DOI:10.1111/bjet.13465

Yerin Kwak, Zachary A. Pardos

{"title":"缩小语言模型的巨大差异：多语言教育内容的技能标记","authors":"Yerin Kwak, Zachary A. Pardos","doi":"10.1111/bjet.13465","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <p>The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open-source and proprietary LLMs with respect to knowledge of K-12 taxonomies in a variety of countries and their performance on tagging problem content with the appropriate skill from a taxonomy, an important task for aligning open educational resources and tutoring content with state curricula. We also experiment with approaches to narrowing the performance divide by enhancing LLM skill tagging performance across four countries (the USA, Ireland, South Korea and India–Maharashtra) for more equitable outcomes. We observe considerable performance disparities not only with non-English languages but with English and non-US taxonomies. Our findings demonstrate that fine-tuning GPT-3.5 with a few labelled examples can improve its proficiency in tagging problems with relevant skills or standards, even for countries and languages that are underrepresented during training. Furthermore, the fine-tuning results show the potential viability of GPT as a multilingual skill classifier. Using both an open-source model, Llama2-13B, and a closed-source model, GPT-3.5, we also observe large disparities in tagging performance between the two and find that fine-tuning and skill information in the prompt improve both, but the closed-source model improves to a much greater extent. Our study contributes to the first empirical results on mitigating disparities across countries and languages with LLMs in an educational context.</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n\n </p><ul>\n \n <li>Recent advances in generative AI have led to increased applications of LLMs in education, offering diverse opportunities.</li>\n \n <li>LLMs excel predominantly in English and exhibit a bias towards the US context.</li>\n \n <li>Automated content tagging has been studied using English-language content and taxonomies.</li>\n </ul>\n <p>What this paper adds\n\n </p><ul>\n \n <li>Investigates the country and language disparities in LLMs concerning knowledge of educational taxonomies and their performance in tagging content.</li>\n \n <li>Presents the first empirical findings on addressing disparities in LLM performance across countries and languages within an educational context.</li>\n \n <li>Improves GPT-3.5's tagging accuracy through fine-tuning, even for non-US countries, starting from zero accuracy.</li>\n \n <li>Extends automated content tagging to non-English languages using both open-source and closed-source LLMs.</li>\n </ul>\n <p>Implications for practice and/or policy\n\n </p><ul>\n \n <li>Underscores the importance of considering the performance generalizability of LLMs to languages other than English.</li>\n \n <li>Highlights the potential viability of ChatGPT as a skill tagging classifier across countries.</li>\n </ul>\n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"55 5","pages":"2039-2057"},"PeriodicalIF":6.7000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bjet.13465","citationCount":"0","resultStr":"{\"title\":\"Bridging large language model disparities: Skill tagging of multilingual educational content\",\"authors\":\"Yerin Kwak, Zachary A. Pardos\",\"doi\":\"10.1111/bjet.13465\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <p>The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open-source and proprietary LLMs with respect to knowledge of K-12 taxonomies in a variety of countries and their performance on tagging problem content with the appropriate skill from a taxonomy, an important task for aligning open educational resources and tutoring content with state curricula. We also experiment with approaches to narrowing the performance divide by enhancing LLM skill tagging performance across four countries (the USA, Ireland, South Korea and India–Maharashtra) for more equitable outcomes. We observe considerable performance disparities not only with non-English languages but with English and non-US taxonomies. Our findings demonstrate that fine-tuning GPT-3.5 with a few labelled examples can improve its proficiency in tagging problems with relevant skills or standards, even for countries and languages that are underrepresented during training. Furthermore, the fine-tuning results show the potential viability of GPT as a multilingual skill classifier. Using both an open-source model, Llama2-13B, and a closed-source model, GPT-3.5, we also observe large disparities in tagging performance between the two and find that fine-tuning and skill information in the prompt improve both, but the closed-source model improves to a much greater extent. Our study contributes to the first empirical results on mitigating disparities across countries and languages with LLMs in an educational context.</p>\\n </section>\\n \\n <section>\\n \\n <div>\\n \\n <div>\\n \\n <h3>Practitioner notes</h3>\\n <p>What is already known about this topic\\n\\n </p><ul>\\n \\n <li>Recent advances in generative AI have led to increased applications of LLMs in education, offering diverse opportunities.</li>\\n \\n <li>LLMs excel predominantly in English and exhibit a bias towards the US context.</li>\\n \\n <li>Automated content tagging has been studied using English-language content and taxonomies.</li>\\n </ul>\\n <p>What this paper adds\\n\\n </p><ul>\\n \\n <li>Investigates the country and language disparities in LLMs concerning knowledge of educational taxonomies and their performance in tagging content.</li>\\n \\n <li>Presents the first empirical findings on addressing disparities in LLM performance across countries and languages within an educational context.</li>\\n \\n <li>Improves GPT-3.5's tagging accuracy through fine-tuning, even for non-US countries, starting from zero accuracy.</li>\\n \\n <li>Extends automated content tagging to non-English languages using both open-source and closed-source LLMs.</li>\\n </ul>\\n <p>Implications for practice and/or policy\\n\\n </p><ul>\\n \\n <li>Underscores the importance of considering the performance generalizability of LLMs to languages other than English.</li>\\n \\n <li>Highlights the potential viability of ChatGPT as a skill tagging classifier across countries.</li>\\n </ul>\\n </div>\\n </div>\\n </section>\\n </div>\",\"PeriodicalId\":48315,\"journal\":{\"name\":\"British Journal of Educational Technology\",\"volume\":\"55 5\",\"pages\":\"2039-2057\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2024-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bjet.13465\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British Journal of Educational Technology\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13465\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13465","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

在教育领域采用大型语言模型（LLMs）大有可为。然而，就像之前的许多技术创新一样，采用和访问往往从一开始就不公平，造成的鸿沟比弥合的鸿沟更多。在本文中，我们探讨了领先的开源和专有 LLM 在国家和语言方面的鸿沟程度，包括对不同国家的 K-12 分类标准的了解程度，以及它们在用分类标准中的适当技能标记问题内容方面的表现，这是使开放教育资源和辅导内容与国家课程保持一致的一项重要任务。我们还尝试通过提高四个国家（美国、爱尔兰、韩国和印度-马哈拉施特拉邦）的 LLM 技能标记性能来缩小性能差距，以获得更公平的结果。我们发现，不仅在非英语语言方面，而且在英语和非美国分类标准方面都存在相当大的性能差距。我们的研究结果表明，使用少量标注示例对 GPT-3.5 进行微调，可以提高其标记相关技能或标准问题的能力，即使对于在培训期间代表性不足的国家和语言也是如此。此外，微调结果还显示了 GPT 作为多语言技能分类器的潜在可行性。通过使用开源模型 Llama2-13B 和闭源模型 GPT-3.5，我们还观察到两者在标记性能上的巨大差异，并发现微调和提示中的技能信息对两者都有改善，但闭源模型的改善程度更大。我们的研究为在教育背景下利用 LLMs 缩小不同国家和语言之间的差距提供了第一批实证结果。LLM 主要擅长英语，并且偏向于美国语境。已有人使用英语内容和分类法对自动内容标记进行了研究。本文的补充内容调查法学硕士在教育分类标准知识方面的国家和语言差异，以及他们在标记内容方面的表现。首次提出了在教育背景下解决不同国家和语言的 LLM 性能差异的实证研究结果。通过微调提高了 GPT-3.5 的标记准确性，即使对于非美国国家，也能从零准确性开始。将自动内容标记扩展到使用开源和闭源 LLM 的非英语语言。对实践和/或政策的启示强调了考虑 LLM 对英语以外语言的性能通用性的重要性。强调了 ChatGPT 作为技能标记分类器在不同国家的潜在可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Bridging large language model disparities: Skill tagging of multilingual educational content

查看原文本刊更多论文

Bridging large language model disparities: Skill tagging of multilingual educational content

The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open-source and proprietary LLMs with respect to knowledge of K-12 taxonomies in a variety of countries and their performance on tagging problem content with the appropriate skill from a taxonomy, an important task for aligning open educational resources and tutoring content with state curricula. We also experiment with approaches to narrowing the performance divide by enhancing LLM skill tagging performance across four countries (the USA, Ireland, South Korea and India–Maharashtra) for more equitable outcomes. We observe considerable performance disparities not only with non-English languages but with English and non-US taxonomies. Our findings demonstrate that fine-tuning GPT-3.5 with a few labelled examples can improve its proficiency in tagging problems with relevant skills or standards, even for countries and languages that are underrepresented during training. Furthermore, the fine-tuning results show the potential viability of GPT as a multilingual skill classifier. Using both an open-source model, Llama2-13B, and a closed-source model, GPT-3.5, we also observe large disparities in tagging performance between the two and find that fine-tuning and skill information in the prompt improve both, but the closed-source model improves to a much greater extent. Our study contributes to the first empirical results on mitigating disparities across countries and languages with LLMs in an educational context.

Practitioner notes

What is already known about this topic

Recent advances in generative AI have led to increased applications of LLMs in education, offering diverse opportunities.
LLMs excel predominantly in English and exhibit a bias towards the US context.
Automated content tagging has been studied using English-language content and taxonomies.

What this paper adds

Investigates the country and language disparities in LLMs concerning knowledge of educational taxonomies and their performance in tagging content.
Presents the first empirical findings on addressing disparities in LLM performance across countries and languages within an educational context.
Improves GPT-3.5's tagging accuracy through fine-tuning, even for non-US countries, starting from zero accuracy.
Extends automated content tagging to non-English languages using both open-source and closed-source LLMs.

Implications for practice and/or policy

Underscores the importance of considering the performance generalizability of LLMs to languages other than English.
Highlights the potential viability of ChatGPT as a skill tagging classifier across countries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

British Journal of Educational Technology EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

15.60

自引率

4.50%

发文量

111

期刊介绍： BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.