An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xuebin Yue, Ziming Wang, Ryuto Ishibashi, Hayata Kaneko, Lin Meng
{"title":"An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions","authors":"Xuebin Yue, Ziming Wang, Ryuto Ishibashi, Hayata Kaneko, Lin Meng","doi":"10.1007/s10032-024-00463-0","DOIUrl":null,"url":null,"abstract":"<p>As one of the most influential Chinese cultural researchers in the second half of the twentieth-century, Professor Shirakawa is active in the research field of ancient Chinese characters. He has left behind many valuable research documents, especially his hand-notated oracle bone inscriptions (OBIs) documents. OBIs are one of the world’s oldest characters and were used in the Shang Dynasty about 3600 years ago for divination and recording events. The organization of OBIs is not only helpful in better understanding Prof. Shirakawa’s research and further study of OBIs in general and their importance in ancient Chinese history. This paper proposes an unsupervised automatic organization method to organize Prof. Shirakawa’s OBIs and construct a handwritten OBIs data set for neural network learning. First, a suite of noise reduction is proposed to remove strangely shaped noise to reduce the data loss of OBIs. Secondly, a novel segmentation method based on the supervised classification of OBIs regions is proposed to reduce adverse effects between characters for more accurate OBIs segmentation. Thirdly, a unique unsupervised clustering method is proposed to classify the segmented characters. Finally, all the same characters in the hand-notated OBIs documents are organized together. The evaluation results show that noise reduction has been proposed to remove noises with an accuracy of 97.85%, which contains number information and closed-loop-like edges in the dataset. In addition, the accuracy of supervised classification of OBIs regions based on our model achieves 85.50%, which is higher than eight state-of-the-art deep learning models, and a particular preprocessing method we proposed improves the classification accuracy by nearly 11.50%. The accuracy of OBIs clustering based on supervised classification achieves 74.91%. These results demonstrate the effectiveness of our proposed unsupervised automatic organization of Prof. Shirakawa’s hand-notated OBIs documents. The code and datasets are available at http://www.ihpc.se.ritsumei.ac.jp/obidataset.html.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"101 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Document Analysis and Recognition","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-024-00463-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

As one of the most influential Chinese cultural researchers in the second half of the twentieth-century, Professor Shirakawa is active in the research field of ancient Chinese characters. He has left behind many valuable research documents, especially his hand-notated oracle bone inscriptions (OBIs) documents. OBIs are one of the world’s oldest characters and were used in the Shang Dynasty about 3600 years ago for divination and recording events. The organization of OBIs is not only helpful in better understanding Prof. Shirakawa’s research and further study of OBIs in general and their importance in ancient Chinese history. This paper proposes an unsupervised automatic organization method to organize Prof. Shirakawa’s OBIs and construct a handwritten OBIs data set for neural network learning. First, a suite of noise reduction is proposed to remove strangely shaped noise to reduce the data loss of OBIs. Secondly, a novel segmentation method based on the supervised classification of OBIs regions is proposed to reduce adverse effects between characters for more accurate OBIs segmentation. Thirdly, a unique unsupervised clustering method is proposed to classify the segmented characters. Finally, all the same characters in the hand-notated OBIs documents are organized together. The evaluation results show that noise reduction has been proposed to remove noises with an accuracy of 97.85%, which contains number information and closed-loop-like edges in the dataset. In addition, the accuracy of supervised classification of OBIs regions based on our model achieves 85.50%, which is higher than eight state-of-the-art deep learning models, and a particular preprocessing method we proposed improves the classification accuracy by nearly 11.50%. The accuracy of OBIs clustering based on supervised classification achieves 74.91%. These results demonstrate the effectiveness of our proposed unsupervised automatic organization of Prof. Shirakawa’s hand-notated OBIs documents. The code and datasets are available at http://www.ihpc.se.ritsumei.ac.jp/obidataset.html.

Abstract Image

白川教授手记甲骨文文献的无监督自动整理方法
作为二十世纪下半叶最具影响力的中国文化研究者之一,白川教授活跃在中国古代文字研究领域。他留下了许多珍贵的研究文献,尤其是他手注的甲骨文文献。甲骨文是世界上最古老的文字之一,约在 3600 年前的商代用于占卜和记录事件。整理甲骨文不仅有助于更好地理解白川教授的研究,还有助于进一步研究甲骨文及其在中国古代历史中的重要性。本文提出了一种无监督自动组织方法来组织白川教授的口述历史,并构建了一个用于神经网络学习的手写口述历史数据集。首先,本文提出了一套降噪方法来去除奇形怪状的噪声,以减少 OBIs 的数据损失。其次,提出了一种基于 OBIs 区域监督分类的新型分割方法,以减少字符之间的不利影响,从而实现更准确的 OBIs 分割。第三,提出一种独特的无监督聚类方法,对分割后的字符进行分类。最后,将手注 OBIs 文档中所有相同的字符组织在一起。评估结果表明,提出的降噪方法去除噪音的准确率达到了 97.85%,其中包含了数据集中的数字信息和类似闭环的边缘。此外,基于我们的模型对OBIs区域进行监督分类的准确率达到了85.50%,高于8个最先进的深度学习模型,而我们提出的一种特殊预处理方法将分类准确率提高了近11.50%。基于监督分类的 OBIs 聚类准确率达到了 74.91%。这些结果证明了我们提出的对白川教授手注 OBIs 文档进行无监督自动组织的有效性。代码和数据集可在 http://www.ihpc.se.ritsumei.ac.jp/obidataset.html 上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal on Document Analysis and Recognition
International Journal on Document Analysis and Recognition 工程技术-计算机:人工智能
CiteScore
6.20
自引率
4.30%
发文量
30
审稿时长
7.5 months
期刊介绍: The large number of existing documents and the production of a multitude of new ones every year raise important issues in efficient handling, retrieval and storage of these documents and the information which they contain. This has led to the emergence of new research domains dealing with the recognition by computers of the constituent elements of documents - including characters, symbols, text, lines, graphics, images, handwriting, signatures, etc. In addition, these new domains deal with automatic analyses of the overall physical and logical structures of documents, with the ultimate objective of a high-level understanding of their semantic content. We have also seen renewed interest in optical character recognition (OCR) and handwriting recognition during the last decade. Document analysis and recognition are obviously the next stage. Automatic, intelligent processing of documents is at the intersections of many fields of research, especially of computer vision, image analysis, pattern recognition and artificial intelligence, as well as studies on reading, handwriting and linguistics. Although quality document related publications continue to appear in journals dedicated to these domains, the community will benefit from having this journal as a focal point for archival literature dedicated to document analysis and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信