Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer

IF 2.5 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition Pub Date : 2024-07-17 DOI:10.1007/s10032-024-00490-x

Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman

{"title":"Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer","authors":"Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman","doi":"10.1007/s10032-024-00490-x","DOIUrl":null,"url":null,"abstract":"<p>Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"49 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Document Analysis and Recognition","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-024-00490-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.

Abstract Image

查看原文本刊更多论文

利用视觉变换器综合离散余弦变换特征确定棕榈叶遗产文件的图像质量

将棕榈叶图像分为不同的质量类别是实现这些遗产文件数字化的重要一步。人工检查和分类不仅费力、费时、费钱，而且会受到检查人员偏见和错误的影响。本研究旨在将棕榈叶文献图像自动分类为三种不同的视觉质量类别。本研究对各种结构和统计特征以及分类器与深度神经网络进行了比较分析。使用了 VGG16、VGG19 和 ResNet152v2 架构以及自定义 CNN 模型，并从传统方法中选择了离散余弦变换 (DCT)、灰度共现矩阵 (GLCM)、Tamura 和梯度直方图 (HOG)。根据这些提取的特征，对各种分类器，即 k-Nearest Neighbors (k-NN)、multi-layer perceptron (MLP)、Support Vector Machines (SVM)、Decision Tree (DT) 和 Logistic Regression (LR) 进行了训练和评估。准确率、精确度、召回率和 F1 分数被用作评估各种算法的性能指标。结果表明，CNN 嵌入和 DCT 特征是最优秀的特征。基于这些发现，我们将 DCT 与视觉变换器 (ViT) 集成到文档分类任务中。结果表明，DCT 与 ViT 的结合优于所有其他方法，训练 F1 得分为 96%，测试 F1 得分为 90%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal on Document Analysis and Recognition 工程技术-计算机：人工智能

CiteScore

6.20

自引率

4.30%

发文量

审稿时长

7.5 months

期刊介绍： The large number of existing documents and the production of a multitude of new ones every year raise important issues in efficient handling, retrieval and storage of these documents and the information which they contain. This has led to the emergence of new research domains dealing with the recognition by computers of the constituent elements of documents - including characters, symbols, text, lines, graphics, images, handwriting, signatures, etc. In addition, these new domains deal with automatic analyses of the overall physical and logical structures of documents, with the ultimate objective of a high-level understanding of their semantic content. We have also seen renewed interest in optical character recognition (OCR) and handwriting recognition during the last decade. Document analysis and recognition are obviously the next stage. Automatic, intelligent processing of documents is at the intersections of many fields of research, especially of computer vision, image analysis, pattern recognition and artificial intelligence, as well as studies on reading, handwriting and linguistics. Although quality document related publications continue to appear in journals dedicated to these domains, the community will benefit from having this journal as a focal point for archival literature dedicated to document analysis and recognition.