Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Menan Velayuthan, Kengatharaiyer Sarveswaran
{"title":"Egalitarian Language Representation in Language Models: It All Begins with Tokenizers","authors":"Menan Velayuthan, Kengatharaiyer Sarveswaran","doi":"arxiv-2409.11501","DOIUrl":null,"url":null,"abstract":"Tokenizers act as a bridge between human language and the latent space of\nlanguage models, influencing how language is represented in these models. Due\nto the immense popularity of English-Centric Large Language Models (LLMs),\nefforts are being made to adapt them for other languages. However, we\ndemonstrate that, from a tokenization standpoint, not all tokenizers offer fair\nrepresentation for complex script languages such as Tamil, Sinhala, and Hindi,\nprimarily due to the choice of pre-tokenization methods. We go further to show\nthat pre-tokenization plays a more critical role than the tokenization\nalgorithm itself in achieving an egalitarian representation of these complex\nscript languages. To address this, we introduce an improvement to the Byte Pair\nEncoding (BPE) algorithm by incorporating graphemes, which we term Grapheme\nPair Encoding (GPE). Our experiments show that grapheme-based character\nextraction outperforms byte-level tokenizers for complex scripts. We validate\nthis approach through experiments on Tamil, Sinhala, and Hindi.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.
语言模型中的平等语言表达:一切从分词器开始
代词化器是人类语言与语言模型潜在空间之间的桥梁,影响着语言在这些模型中的表达方式。由于以英语为中心的大语言模型(LLMs)大受欢迎,人们正努力将其适用于其他语言。然而,我们证明,从标记化的角度来看,并非所有标记化器都能公平地表示泰米尔语、僧伽罗语和印地语等复杂文字语言,这主要是由于选择了预标记化方法。我们进一步证明,在实现这些复杂文字语言的公平表示方面,预标记化比标记化算法本身起着更加关键的作用。为了解决这个问题,我们引入了一种改进的字节对编码(BPE)算法,将词素纳入其中,我们称之为词素对编码(GPE)。我们的实验表明,对于复杂的脚本,基于词素的字符提取效果优于字节级标记器。我们通过对泰米尔语、僧伽罗语和印地语的实验验证了这种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信