{"title":"阿拉伯语OCR的新策略:Archigraphemes,字母块,脚本语法和形状合成","authors":"Thomas Milo, A. Martínez","doi":"10.1145/3322905.3322928","DOIUrl":null,"url":null,"abstract":"Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards static OCR technology. The next step is to create or find matching AI software to teach OCR to recognize any unmapped letter blocks in order to make the OCR dynamic.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis\",\"authors\":\"Thomas Milo, A. Martínez\",\"doi\":\"10.1145/3322905.3322928\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards static OCR technology. The next step is to create or find matching AI software to teach OCR to recognize any unmapped letter blocks in order to make the OCR dynamic.\",\"PeriodicalId\":418911,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3322905.3322928\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
摘要
由于文字模型缺乏科学依据,目前的OCR对阿拉伯文的处理能力有限。我们提出了一种新的阿拉伯语OCR策略,基于1。伊斯兰文字语法包括扩展形和2。将阿拉伯文字视为多层书写系统。我们分析阿拉伯文作为文字抽象的异体文字呈现。字素是一个由音韵学改编而来的术语;它类似于音素这个术语。在音系学中,声音的最小功能单位是音素。这不是听见的,而是看见的。人们听到的是语境条件的音素。在阿拉伯语正字法中,最小的拼写功能单位是字素。这不是看到的,而是感知到的。人们看到的是语境条件异位词。在我们的分析中,字母块是阿拉伯文字构成的最小单位,因此也是字母语法的最小单位。字母块是由图形空间包围的单个或一组融合的同种异体字。与音系学的类比可以进一步推进:音位是两个或多个音位之间的共同特征的集合,减去它们的独特特征。建筑字素是两个或多个字素之间的共同特征的集合,减去它们的独特特征。一个建筑字母块由一个或多个在空格之间的简化的异位字组成。字母块在基线后面。字母块之间可以有连字符。在我们的策略中,建筑字母块也构成了OCR的最小单位。我们已经(1)实现了一种算法,该算法可以将阿拉伯脚本中的任何Unicode文本减少到archigraphemes,并使用它来创建一个Unicode格式的列表,其中包含互联网上所有经过验证的独特的archigraphemic字母块。(2)有了这个列表,并应用扩展的伊斯兰文字语法,我们可以在给定的风格中合成所有可能的建筑文字融合的现实图像。这两项发展使得创建一个OCR系统成为可能,该系统可以在受控条件下识别给定风格的基本和扩展造型的合成阿拉伯语。这两个步骤产生胜任力,之后OCR系统应该接受训练,以适应实际文件中性能变化的容忍度。为了从语言上解释已识别的字母块,必须开发一种解析建筑字母的技术。例如,三个建筑字母块的单一序列EBD A LLH可以被解释为几个不同的表面文本,如abda-n li llaahi, abdu l-laahi和inda l-laahi。为了促进该过程的语言阶段,设计了相同的独特建筑字母块列表,以确定审查文本的语言。在这个阶段,我们可以展示•伊斯兰文字合成•从完全正字法到建筑音译的Unicode转换•建筑搜索算法•独特的建筑字母块列表•真实形状生成的样本这些是走向静态OCR技术的第一步。下一步是创建或找到匹配的人工智能软件,教OCR识别任何未映射的字母块,以使OCR动态。
A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis
Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards static OCR technology. The next step is to create or find matching AI software to teach OCR to recognize any unmapped letter blocks in order to make the OCR dynamic.