{"title":"Storing Unicode data in TeX engines","authors":"Joseph Wright","doi":"10.47397/tb/44-1/tb136wright-unidata","DOIUrl":null,"url":null,"abstract":"Unicode has become established over the past three decades as the international standard for representing text in computer systems. By far the most common input encoding in use today is UTF-8, in which Unicode text is represented by a variable number of bytes: between one and four. Unicode deals with codepoints: a numerical representation for each character. There are in principle 1 114 112 codepoints available, although not all are currently assigned and some of these are reserved for ‘private use’ for ad hoc requirements. Each codepoint has many different properties. For example, depending on our application, we might need to know whether a codepoint is a (lower case) letter, how it should be treated at a line break, how its width is treated (for East Asian characters), etc. Unicode provides a range of data files which tabulate this information. These files are human-readable and are, in the main, purely ASCII text: they are therefore not tied to any particular programming language for usage. The full set of files is available from unicode.org/Public/UCD/latest/ucd/: the complete current set as a zip is around 6.7MiB. There are of course standard libraries for common programming languages such as C which both load this data and provide implementations of the algorithms which use this data: things like changing case, breaking text into lines and so on. However, these are not readily available to us as TEX programmers. Thus, if we want to be able to properly implement Unicode algorithms, we will need to look at how to load the relevant data and store it within TEX in an efficient manner. Here, I will focus on how the LTEX team is approaching the data storage challenge. I will show how the particular requirements of implementing in TEX mean we need to use a mix of approaches, depending on exactly which data we are looking at. The current implementation for loading this data in expl3 is available at github.com/latex3/latex3/ blob/main/l3kernel/l3unicode.dtx, and is read as part of the LTEX2ε format-building process.","PeriodicalId":93390,"journal":{"name":"TUGboat (Providence, R.I.)","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"TUGboat (Providence, R.I.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47397/tb/44-1/tb136wright-unidata","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Unicode has become established over the past three decades as the international standard for representing text in computer systems. By far the most common input encoding in use today is UTF-8, in which Unicode text is represented by a variable number of bytes: between one and four. Unicode deals with codepoints: a numerical representation for each character. There are in principle 1 114 112 codepoints available, although not all are currently assigned and some of these are reserved for ‘private use’ for ad hoc requirements. Each codepoint has many different properties. For example, depending on our application, we might need to know whether a codepoint is a (lower case) letter, how it should be treated at a line break, how its width is treated (for East Asian characters), etc. Unicode provides a range of data files which tabulate this information. These files are human-readable and are, in the main, purely ASCII text: they are therefore not tied to any particular programming language for usage. The full set of files is available from unicode.org/Public/UCD/latest/ucd/: the complete current set as a zip is around 6.7MiB. There are of course standard libraries for common programming languages such as C which both load this data and provide implementations of the algorithms which use this data: things like changing case, breaking text into lines and so on. However, these are not readily available to us as TEX programmers. Thus, if we want to be able to properly implement Unicode algorithms, we will need to look at how to load the relevant data and store it within TEX in an efficient manner. Here, I will focus on how the LTEX team is approaching the data storage challenge. I will show how the particular requirements of implementing in TEX mean we need to use a mix of approaches, depending on exactly which data we are looking at. The current implementation for loading this data in expl3 is available at github.com/latex3/latex3/ blob/main/l3kernel/l3unicode.dtx, and is read as part of the LTEX2ε format-building process.