Storing Unicode data in TeX engines

TUGboat (Providence, R.I.) Pub Date : 2023-01-01 DOI:10.47397/tb/44-1/tb136wright-unidata

Joseph Wright

{"title":"Storing Unicode data in TeX engines","authors":"Joseph Wright","doi":"10.47397/tb/44-1/tb136wright-unidata","DOIUrl":null,"url":null,"abstract":"Unicode has become established over the past three decades as the international standard for representing text in computer systems. By far the most common input encoding in use today is UTF-8, in which Unicode text is represented by a variable number of bytes: between one and four. Unicode deals with codepoints: a numerical representation for each character. There are in principle 1 114 112 codepoints available, although not all are currently assigned and some of these are reserved for ‘private use’ for ad hoc requirements. Each codepoint has many different properties. For example, depending on our application, we might need to know whether a codepoint is a (lower case) letter, how it should be treated at a line break, how its width is treated (for East Asian characters), etc. Unicode provides a range of data files which tabulate this information. These files are human-readable and are, in the main, purely ASCII text: they are therefore not tied to any particular programming language for usage. The full set of files is available from unicode.org/Public/UCD/latest/ucd/: the complete current set as a zip is around 6.7MiB. There are of course standard libraries for common programming languages such as C which both load this data and provide implementations of the algorithms which use this data: things like changing case, breaking text into lines and so on. However, these are not readily available to us as TEX programmers. Thus, if we want to be able to properly implement Unicode algorithms, we will need to look at how to load the relevant data and store it within TEX in an efficient manner. Here, I will focus on how the LTEX team is approaching the data storage challenge. I will show how the particular requirements of implementing in TEX mean we need to use a mix of approaches, depending on exactly which data we are looking at. The current implementation for loading this data in expl3 is available at github.com/latex3/latex3/ blob/main/l3kernel/l3unicode.dtx, and is read as part of the LTEX2ε format-building process.","PeriodicalId":93390,"journal":{"name":"TUGboat (Providence, R.I.)","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"TUGboat (Providence, R.I.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47397/tb/44-1/tb136wright-unidata","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Unicode has become established over the past three decades as the international standard for representing text in computer systems. By far the most common input encoding in use today is UTF-8, in which Unicode text is represented by a variable number of bytes: between one and four. Unicode deals with codepoints: a numerical representation for each character. There are in principle 1 114 112 codepoints available, although not all are currently assigned and some of these are reserved for ‘private use’ for ad hoc requirements. Each codepoint has many different properties. For example, depending on our application, we might need to know whether a codepoint is a (lower case) letter, how it should be treated at a line break, how its width is treated (for East Asian characters), etc. Unicode provides a range of data files which tabulate this information. These files are human-readable and are, in the main, purely ASCII text: they are therefore not tied to any particular programming language for usage. The full set of files is available from unicode.org/Public/UCD/latest/ucd/: the complete current set as a zip is around 6.7MiB. There are of course standard libraries for common programming languages such as C which both load this data and provide implementations of the algorithms which use this data: things like changing case, breaking text into lines and so on. However, these are not readily available to us as TEX programmers. Thus, if we want to be able to properly implement Unicode algorithms, we will need to look at how to load the relevant data and store it within TEX in an efficient manner. Here, I will focus on how the LTEX team is approaching the data storage challenge. I will show how the particular requirements of implementing in TEX mean we need to use a mix of approaches, depending on exactly which data we are looking at. The current implementation for loading this data in expl3 is available at github.com/latex3/latex3/ blob/main/l3kernel/l3unicode.dtx, and is read as part of the LTEX2ε format-building process.

查看原文本刊更多论文

在TeX引擎中存储Unicode数据

在过去的三十年里，Unicode已经成为计算机系统中表示文本的国际标准。目前使用的最常见的输入编码是UTF-8，其中Unicode文本由可变字节数表示:在1到4之间。Unicode处理码点:每个字符的数字表示。原则上有1 114 112个可用的代码点，尽管目前并不是所有的代码点都被分配，其中一些被保留给“私人使用”以满足特殊需求。每个代码点都有许多不同的属性。例如，根据我们的应用程序，我们可能需要知道代码点是否是一个(小写)字母，它在换行时应该如何处理，它的宽度如何处理(对于东亚字符)，等等。Unicode提供了一系列将这些信息制成表格的数据文件。这些文件是人类可读的，并且主要是纯ASCII文本:因此它们不受任何特定编程语言的约束。完整的文件集可从unicode.org/Public/UCD/latest/ucd/:获得，完整的当前文件集大约为6.7MiB。当然，有一些通用编程语言的标准库，比如C语言，它们既可以加载这些数据，也可以提供使用这些数据的算法的实现:比如改变大小写，将文本分成几行等等。然而，对于我们这些TEX程序员来说，这些并不容易获得。因此，如果我们希望能够正确地实现Unicode算法，我们将需要研究如何以有效的方式加载相关数据并将其存储在TEX中。在这里，我将重点介绍LTEX团队如何应对数据存储挑战。我将说明在TEX中实现的特殊需求意味着我们需要使用多种方法，具体取决于我们正在查看的数据。在expl3中加载此数据的当前实现可在github.com/latex3/latex3/ blob/main/l3kernel/l3unicode获得。dtx，并作为LTEX2ε格式构建过程的一部分读取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

TUGboat (Providence, R.I.)

自引率

0.00%

发文量