Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11501

Menan Velayuthan, Kengatharaiyer Sarveswaran

引用次数: 0

Abstract

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.

查看原文本刊更多论文

语言模型中的平等语言表达：一切从分词器开始

代词化器是人类语言与语言模型潜在空间之间的桥梁，影响着语言在这些模型中的表达方式。由于以英语为中心的大语言模型（LLMs）大受欢迎，人们正努力将其适用于其他语言。然而，我们证明，从标记化的角度来看，并非所有标记化器都能公平地表示泰米尔语、僧伽罗语和印地语等复杂文字语言，这主要是由于选择了预标记化方法。我们进一步证明，在实现这些复杂文字语言的公平表示方面，预标记化比标记化算法本身起着更加关键的作用。为了解决这个问题，我们引入了一种改进的字节对编码（BPE）算法，将词素纳入其中，我们称之为词素对编码（GPE）。我们的实验表明，对于复杂的脚本，基于词素的字符提取效果优于字节级标记器。我们通过对泰米尔语、僧伽罗语和印地语的实验验证了这种方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量