OBC306: A Large-Scale Oracle Bone Character Recognition Dataset

Shuangping Huang, Haobin Wang, Yong-ge Liu, Xiaosong Shi, Lianwen Jin
{"title":"OBC306: A Large-Scale Oracle Bone Character Recognition Dataset","authors":"Shuangping Huang, Haobin Wang, Yong-ge Liu, Xiaosong Shi, Lianwen Jin","doi":"10.1109/ICDAR.2019.00114","DOIUrl":null,"url":null,"abstract":"The oracle bone script from ancient China is among the world's most famous ancient writing systems. Identifying and deciphering oracle bone scripts is one of the most important topics in oracle bone study and requires a deep familiarity with the culture of ancient China. This task remains very challenging for two reasons. The first is that it is executed mainly by humans and requires a high level of experience, aptitude, and commitment. The second is due to the scarcity of domain-specific data, which hinders the advancement of automatic recognition research. A collection of well-labeled oracle-bone data is necessary to bridge the oracle bone and information processing fields; however, such a dataset has not yet been presented. Hence, in this paper, we construct a new large-scale dataset of oracle bone characters called OBC306. We also present the standard deep convolutional neural network-based evaluation for this dataset to serve as a benchmark. Through statistical and visual analyses, we describe the inherent difficulties of oracle bone recognition and propose future challenges for and extensions of oracle bone study using information processing. This dataset contains more than 300,000 character-level samples cropped from oracle-bone rubbings or images. It covers 306 glyph classes and is the largest existing raw oracle-bone character set, to the best of our knowledge. It is anticipated the publication of this dataset will facilitate the development of oracle bone research and lead to optimal algorithmic solutions.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

The oracle bone script from ancient China is among the world's most famous ancient writing systems. Identifying and deciphering oracle bone scripts is one of the most important topics in oracle bone study and requires a deep familiarity with the culture of ancient China. This task remains very challenging for two reasons. The first is that it is executed mainly by humans and requires a high level of experience, aptitude, and commitment. The second is due to the scarcity of domain-specific data, which hinders the advancement of automatic recognition research. A collection of well-labeled oracle-bone data is necessary to bridge the oracle bone and information processing fields; however, such a dataset has not yet been presented. Hence, in this paper, we construct a new large-scale dataset of oracle bone characters called OBC306. We also present the standard deep convolutional neural network-based evaluation for this dataset to serve as a benchmark. Through statistical and visual analyses, we describe the inherent difficulties of oracle bone recognition and propose future challenges for and extensions of oracle bone study using information processing. This dataset contains more than 300,000 character-level samples cropped from oracle-bone rubbings or images. It covers 306 glyph classes and is the largest existing raw oracle-bone character set, to the best of our knowledge. It is anticipated the publication of this dataset will facilitate the development of oracle bone research and lead to optimal algorithmic solutions.
一个大规模的甲骨文字符识别数据集
中国古代的甲骨文是世界上最著名的古代文字系统之一。甲骨文的识别与破译是甲骨文研究的重要课题之一,需要对中国古代文化有深入的了解。由于两个原因,这项任务仍然非常具有挑战性。首先,它主要是由人类执行的,需要高水平的经验、能力和承诺。二是由于特定领域数据的稀缺性,阻碍了自动识别研究的推进。良好标记的甲骨文数据是连接甲骨文和信息处理领域的桥梁;然而,目前还没有这样的数据集。因此,本文构建了一个新的大规模甲骨文数据集OBC306。我们还提出了该数据集的标准深度卷积神经网络评估作为基准。通过统计和可视化分析,我们描述了甲骨文识别的固有困难,并提出了未来甲骨文信息处理研究的挑战和扩展。这个数据集包含了超过30万个字符级别的样本,这些样本是从甲骨文拓片或图像中裁剪出来的。据我们所知,它涵盖了306个字形类,是现存最大的原始甲骨文字符集。预计该数据集的出版将促进甲骨文研究的发展,并导致最佳算法解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信