Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-09-18 DOI:10.1016/j.dib.2025.112088

Gonzalo Martínez , Marina Mayor-Rocher , Cris Pozo Huertas , Nina Melero , María Grandury , Pedro Reviriego

{"title":"Spanish is not just one: A dataset of Spanish dialect recognition for LLMs","authors":"Gonzalo Martínez , Marina Mayor-Rocher , Cris Pozo Huertas , Nina Melero , María Grandury , Pedro Reviriego","doi":"10.1016/j.dib.2025.112088","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a dataset designed to assess the capability of Large Language Models (LLMs) in handling different Spanish dialects. While multilingualism is widely recognized as a crucial aspect of NLP, dialectal evaluation remains largely unexplored. Spanish, spoken by over 600 million people, exhibits significant lexical, morphological, and syntactic variation across regions. Recognizing these linguistic and cultural differences is essential for preserving smaller dialects, preventing their marginalization, and ensuring that Spanish is not reduced to a monolithic language. To address this gap, we introduce a dataset specifically designed to analyze whether LLMs can accurately identify different Spanish varieties while also measuring their potential preference for specific dialects. The dataset consists of 30 carefully crafted multiple-choice questions, requiring models to select the most appropriate option from different regional variations. Each question has been meticulously developed and reviewed by linguistic experts, undergoing multiple refinement cycles to ensure linguistic accuracy and effectiveness in detecting dialectal biases. This dataset represents an important step toward developing more inclusive and fair evaluation frameworks for Spanish Natural Language Processing (NLP). By identifying potential biases in LLMs and analyzing their ability to adapt to regional linguistic variations, this work contributes to the broader goal of equitable language representation in AI-driven text generation and comprehension tasks.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"63 ","pages":"Article 112088"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925008108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a dataset designed to assess the capability of Large Language Models (LLMs) in handling different Spanish dialects. While multilingualism is widely recognized as a crucial aspect of NLP, dialectal evaluation remains largely unexplored. Spanish, spoken by over 600 million people, exhibits significant lexical, morphological, and syntactic variation across regions. Recognizing these linguistic and cultural differences is essential for preserving smaller dialects, preventing their marginalization, and ensuring that Spanish is not reduced to a monolithic language. To address this gap, we introduce a dataset specifically designed to analyze whether LLMs can accurately identify different Spanish varieties while also measuring their potential preference for specific dialects. The dataset consists of 30 carefully crafted multiple-choice questions, requiring models to select the most appropriate option from different regional variations. Each question has been meticulously developed and reviewed by linguistic experts, undergoing multiple refinement cycles to ensure linguistic accuracy and effectiveness in detecting dialectal biases. This dataset represents an important step toward developing more inclusive and fair evaluation frameworks for Spanish Natural Language Processing (NLP). By identifying potential biases in LLMs and analyzing their ability to adapt to regional linguistic variations, this work contributes to the broader goal of equitable language representation in AI-driven text generation and comprehension tasks.

查看原文本刊更多论文

西班牙语不仅仅是一种：法学硕士西班牙方言识别数据集

本文提出了一个旨在评估大型语言模型（llm）处理不同西班牙语方言能力的数据集。虽然多种语言被广泛认为是NLP的一个重要方面，但方言评估在很大程度上仍未被探索。超过6亿人使用的西班牙语在不同地区表现出显著的词汇、形态和句法差异。认识到这些语言和文化差异对于保护较小的方言，防止它们被边缘化，并确保西班牙语不会沦为一种单一的语言至关重要。为了解决这一差距，我们引入了一个专门设计的数据集，用于分析法学硕士是否能够准确识别不同的西班牙语变体，同时测量他们对特定方言的潜在偏好。数据集由30个精心设计的多项选择题组成，需要模型从不同地区的差异中选择最合适的选项。每个问题都经过语言专家的精心开发和审查，经过多次改进周期，以确保语言的准确性和检测方言偏见的有效性。该数据集代表了为西班牙语自然语言处理（NLP）开发更具包容性和公平的评估框架的重要一步。通过识别法学硕士的潜在偏见并分析其适应区域语言差异的能力，这项工作有助于在人工智能驱动的文本生成和理解任务中实现公平的语言表示这一更广泛的目标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.