Marcelo Sartori Locatelli, Matheus Prado Miranda, Igor Joaquim da Silva Costa, Matheus Torres Prates, Victor Thomé, Mateus Zaparoli Monteiro, Tomas Lacerda, Adriana Pagano, Eduardo Rios Neto, Wagner Meira Jr., Virgilio Almeida
{"title":"在巴西国家标准化考试框架内考察法律硕士架构的行为","authors":"Marcelo Sartori Locatelli, Matheus Prado Miranda, Igor Joaquim da Silva Costa, Matheus Torres Prates, Victor Thomé, Mateus Zaparoli Monteiro, Tomas Lacerda, Adriana Pagano, Eduardo Rios Neto, Wagner Meira Jr., Virgilio Almeida","doi":"arxiv-2408.05035","DOIUrl":null,"url":null,"abstract":"The Exame Nacional do Ensino M\\'edio (ENEM) is a pivotal test for Brazilian\nstudents, required for admission to a significant number of universities in\nBrazil. The test consists of four objective high-school level tests on Math,\nHumanities, Natural Sciences and Languages, and one writing essay. Students'\nanswers to the test and to the accompanying socioeconomic status questionnaire\nare made public every year (albeit anonymized) due to transparency policies\nfrom the Brazilian Government. In the context of large language models (LLMs),\nthese data lend themselves nicely to comparing different groups of humans with\nAI, as we can have access to human and machine answer distributions. We\nleverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4,\nand MariTalk, a model trained using Portuguese data, to humans, aiming to\nascertain how their answers relate to real societal groups and what that may\nreveal about the model biases. We divide the human groups by using\nsocioeconomic status (SES), and compare their answer distribution with LLMs for\neach question and for the essay. We find no significant biases when comparing\nLLM performance to humans on the multiple-choice Brazilian Portuguese tests, as\nthe distance between model and human answers is mostly determined by the human\naccuracy. A similar conclusion is found by looking at the generated text as,\nwhen analyzing the essays, we observe that human and LLM essays differ in a few\nkey factors, one being the choice of words where model essays were easily\nseparable from human ones. The texts also differ syntactically, with LLM\ngenerated essays exhibiting, on average, smaller sentences and less thought\nunits, among other differences. These results suggest that, for Brazilian\nPortuguese in the ENEM context, LLM outputs represent no group of humans, being\nsignificantly different from the answers from Brazilian students across all\ntests.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"107 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil\",\"authors\":\"Marcelo Sartori Locatelli, Matheus Prado Miranda, Igor Joaquim da Silva Costa, Matheus Torres Prates, Victor Thomé, Mateus Zaparoli Monteiro, Tomas Lacerda, Adriana Pagano, Eduardo Rios Neto, Wagner Meira Jr., Virgilio Almeida\",\"doi\":\"arxiv-2408.05035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Exame Nacional do Ensino M\\\\'edio (ENEM) is a pivotal test for Brazilian\\nstudents, required for admission to a significant number of universities in\\nBrazil. The test consists of four objective high-school level tests on Math,\\nHumanities, Natural Sciences and Languages, and one writing essay. Students'\\nanswers to the test and to the accompanying socioeconomic status questionnaire\\nare made public every year (albeit anonymized) due to transparency policies\\nfrom the Brazilian Government. In the context of large language models (LLMs),\\nthese data lend themselves nicely to comparing different groups of humans with\\nAI, as we can have access to human and machine answer distributions. We\\nleverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4,\\nand MariTalk, a model trained using Portuguese data, to humans, aiming to\\nascertain how their answers relate to real societal groups and what that may\\nreveal about the model biases. We divide the human groups by using\\nsocioeconomic status (SES), and compare their answer distribution with LLMs for\\neach question and for the essay. We find no significant biases when comparing\\nLLM performance to humans on the multiple-choice Brazilian Portuguese tests, as\\nthe distance between model and human answers is mostly determined by the human\\naccuracy. A similar conclusion is found by looking at the generated text as,\\nwhen analyzing the essays, we observe that human and LLM essays differ in a few\\nkey factors, one being the choice of words where model essays were easily\\nseparable from human ones. The texts also differ syntactically, with LLM\\ngenerated essays exhibiting, on average, smaller sentences and less thought\\nunits, among other differences. These results suggest that, for Brazilian\\nPortuguese in the ENEM context, LLM outputs represent no group of humans, being\\nsignificantly different from the answers from Brazilian students across all\\ntests.\",\"PeriodicalId\":501112,\"journal\":{\"name\":\"arXiv - CS - Computers and Society\",\"volume\":\"107 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computers and Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.05035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.05035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
巴西国家教育考试(ENEM)是巴西学生的一项重要考试,是巴西许多大学的入学考试。该考试包括数学、人文科学、自然科学和语言四门高中水平的客观测试,以及一篇写作论文。由于巴西政府的透明政策,学生对测试和随附的社会经济状况问卷的答案每年都会公开(尽管是匿名的)。在大型语言模型(LLM)的背景下,这些数据非常适合将不同的人类群体与人工智能进行比较,因为我们可以获得人类和机器的答案分布。我们利用 ENEM 数据集的这些特点,将 GPT-3.5 和 4,以及使用葡萄牙语数据训练的模型 MariTalk 与人类进行比较,旨在确定他们的答案与真实社会群体的关系,以及这可能揭示的模型偏差。我们按照社会经济地位(SES)划分人类群体,并将他们的答案分布与 LLMs 的每个问题和文章进行比较。我们发现,在巴西葡萄牙语的多选题测试中,将 LLM 的表现与人类进行比较并没有发现明显的偏差,因为模型与人类答案之间的距离主要由人类的准确性决定。通过观察生成的文本,我们也发现了类似的结论,因为在分析文章时,我们发现人类和 LLM 的文章在几个关键因素上存在差异,其中之一就是选词,模型文章很容易与人类文章区分开来。文本在句法上也存在差异,除其他差异外,LLM 生成的文章平均句子较小,思想单元较少。这些结果表明,对于 ENEM 语境中的巴西葡萄牙语,LLM 输出结果并不代表人类群体,在所有测试中都与巴西学生的答案存在显著差异。
Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil
The Exame Nacional do Ensino M\'edio (ENEM) is a pivotal test for Brazilian
students, required for admission to a significant number of universities in
Brazil. The test consists of four objective high-school level tests on Math,
Humanities, Natural Sciences and Languages, and one writing essay. Students'
answers to the test and to the accompanying socioeconomic status questionnaire
are made public every year (albeit anonymized) due to transparency policies
from the Brazilian Government. In the context of large language models (LLMs),
these data lend themselves nicely to comparing different groups of humans with
AI, as we can have access to human and machine answer distributions. We
leverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4,
and MariTalk, a model trained using Portuguese data, to humans, aiming to
ascertain how their answers relate to real societal groups and what that may
reveal about the model biases. We divide the human groups by using
socioeconomic status (SES), and compare their answer distribution with LLMs for
each question and for the essay. We find no significant biases when comparing
LLM performance to humans on the multiple-choice Brazilian Portuguese tests, as
the distance between model and human answers is mostly determined by the human
accuracy. A similar conclusion is found by looking at the generated text as,
when analyzing the essays, we observe that human and LLM essays differ in a few
key factors, one being the choice of words where model essays were easily
separable from human ones. The texts also differ syntactically, with LLM
generated essays exhibiting, on average, smaller sentences and less thought
units, among other differences. These results suggest that, for Brazilian
Portuguese in the ENEM context, LLM outputs represent no group of humans, being
significantly different from the answers from Brazilian students across all
tests.