Nhan Thien Nguyen, Dang Minh Nguyen, A. D. Le, T. Quan
{"title":"将深度学习与语言模型相结合,识别现代日本杂志","authors":"Nhan Thien Nguyen, Dang Minh Nguyen, A. D. Le, T. Quan","doi":"10.1109/KSE53942.2021.9648643","DOIUrl":null,"url":null,"abstract":"As one of the most culturally rich countries globally, Japan also has a rich history of magazines. In modern Japanese magazines, which were published during the 19-20th centuries, Japanese usage is similar to the current style of the contemporary Japanese language. However, most of those documents are not digitized but stored as images. Due to their importance to Japanese culture, history, and other socio-scientific topics, the problem of using computers to help identify these image-based modern magazines has been investigated from research and widely disseminated through the use of different methods in Deep Learning. However, these methods are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters. In this research, we address this problem by developing a deep learning-based language model and integrating it into the current OCR system for modern Japanese magazine documents. We also propose a combination strategy between the current Japanese OCR tool and our language model. The strategy will learn where the system should rely on OCR (e.g., Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji characters are frequently recognized incorrectly by OCR, the system should rely on the language model). Our method enjoys visible improvement once experimented with real data.","PeriodicalId":130986,"journal":{"name":"2021 13th International Conference on Knowledge and Systems Engineering (KSE)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Recognizing modern Japanese magazines by combining Deep Learning with language models\",\"authors\":\"Nhan Thien Nguyen, Dang Minh Nguyen, A. D. Le, T. Quan\",\"doi\":\"10.1109/KSE53942.2021.9648643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As one of the most culturally rich countries globally, Japan also has a rich history of magazines. In modern Japanese magazines, which were published during the 19-20th centuries, Japanese usage is similar to the current style of the contemporary Japanese language. However, most of those documents are not digitized but stored as images. Due to their importance to Japanese culture, history, and other socio-scientific topics, the problem of using computers to help identify these image-based modern magazines has been investigated from research and widely disseminated through the use of different methods in Deep Learning. However, these methods are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters. In this research, we address this problem by developing a deep learning-based language model and integrating it into the current OCR system for modern Japanese magazine documents. We also propose a combination strategy between the current Japanese OCR tool and our language model. The strategy will learn where the system should rely on OCR (e.g., Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji characters are frequently recognized incorrectly by OCR, the system should rely on the language model). Our method enjoys visible improvement once experimented with real data.\",\"PeriodicalId\":130986,\"journal\":{\"name\":\"2021 13th International Conference on Knowledge and Systems Engineering (KSE)\",\"volume\":\"255 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 13th International Conference on Knowledge and Systems Engineering (KSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KSE53942.2021.9648643\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE53942.2021.9648643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Recognizing modern Japanese magazines by combining Deep Learning with language models
As one of the most culturally rich countries globally, Japan also has a rich history of magazines. In modern Japanese magazines, which were published during the 19-20th centuries, Japanese usage is similar to the current style of the contemporary Japanese language. However, most of those documents are not digitized but stored as images. Due to their importance to Japanese culture, history, and other socio-scientific topics, the problem of using computers to help identify these image-based modern magazines has been investigated from research and widely disseminated through the use of different methods in Deep Learning. However, these methods are still limited to achieve strong performance in recognizing handwriting images, especially uncommon Kanji characters. In this research, we address this problem by developing a deep learning-based language model and integrating it into the current OCR system for modern Japanese magazine documents. We also propose a combination strategy between the current Japanese OCR tool and our language model. The strategy will learn where the system should rely on OCR (e.g., Hiragana and Common kanji characters recognized correctly by OCR) or language model (uncommon Kanji characters are frequently recognized incorrectly by OCR, the system should rely on the language model). Our method enjoys visible improvement once experimented with real data.