Recognizing modern Japanese magazines by combining Deep Learning with language models

2021 13th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2021-11-10 DOI:10.1109/KSE53942.2021.9648643

Nhan Thien Nguyen, Dang Minh Nguyen, A. D. Le, T. Quan

{"title":"Recognizing modern Japanese magazines by combining Deep Learning with language models","authors":"Nhan Thien Nguyen, Dang Minh Nguyen, A. D. Le, T. Quan","doi":"10.1109/KSE53942.2021.9648643","DOIUrl":null,"url":null,"abstract":"As one of the most culturally rich countries globally, Japan also has a rich history of magazines. In modern Japanese magazines, which were published during the 19-20th centuries, Japanese usage is similar to the current style of the contemporary Japanese language. However, most of those documents are not digitized but stored as images. Due to their importance to Japanese culture, history, and other socio-scientific topics, the problem of using computers to help identify these image-based modern magazines has been investigated from research and widely disseminated through the use of different methods in Deep Learning. However, these methods are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters. In this research, we address this problem by developing a deep learning-based language model and integrating it into the current OCR system for modern Japanese magazine documents. We also propose a combination strategy between the current Japanese OCR tool and our language model. The strategy will learn where the system should rely on OCR (e.g., Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji characters are frequently recognized incorrectly by OCR, the system should rely on the language model). Our method enjoys visible improvement once experimented with real data.","PeriodicalId":130986,"journal":{"name":"2021 13th International Conference on Knowledge and Systems Engineering (KSE)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE53942.2021.9648643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As one of the most culturally rich countries globally, Japan also has a rich history of magazines. In modern Japanese magazines, which were published during the 19-20th centuries, Japanese usage is similar to the current style of the contemporary Japanese language. However, most of those documents are not digitized but stored as images. Due to their importance to Japanese culture, history, and other socio-scientific topics, the problem of using computers to help identify these image-based modern magazines has been investigated from research and widely disseminated through the use of different methods in Deep Learning. However, these methods are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters. In this research, we address this problem by developing a deep learning-based language model and integrating it into the current OCR system for modern Japanese magazine documents. We also propose a combination strategy between the current Japanese OCR tool and our language model. The strategy will learn where the system should rely on OCR (e.g., Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji characters are frequently recognized incorrectly by OCR, the system should rely on the language model). Our method enjoys visible improvement once experimented with real data.

查看原文本刊更多论文

将深度学习与语言模型相结合，识别现代日本杂志

作为世界上文化最丰富的国家之一，日本也有着丰富的杂志历史。在19-20世纪出版的现代日本杂志中，日语的用法与当代日语的风格相似。然而，这些文件中的大多数没有被数字化，而是以图像的形式存储。由于它们对日本文化、历史和其他社会科学主题的重要性，使用计算机来帮助识别这些基于图像的现代杂志的问题已经从研究中进行了调查，并通过使用深度学习中的不同方法进行了广泛传播。然而，这些方法在识别手写图像，特别是不常见的汉字字符方面仍然有一定的局限性。在本研究中，我们通过开发一种基于深度学习的语言模型并将其集成到当前的现代日语杂志文档OCR系统中来解决这个问题。我们还提出了一种结合当前日语OCR工具和我们的语言模型的策略。该策略将学习系统应该依赖OCR的哪些地方(例如，平假名和普通汉字被OCR正确识别)或语言模型(不常见的汉字经常被OCR错误识别，系统应该依赖语言模型)。经过实际数据的实验，我们的方法有了明显的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 13th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量