A. M. Abbas, M. S. Hameed, S. Balakrishnan, K. Anandh
{"title":"基于光学字符识别和标记的智能文档查找","authors":"A. M. Abbas, M. S. Hameed, S. Balakrishnan, K. Anandh","doi":"10.1109/ICACRS55517.2022.10029142","DOIUrl":null,"url":null,"abstract":"In the era of digitalization, the assortment and exploration of great volumes of documents is becoming progressively significant for enterprises to increase their productions and practices. Optical Character Recognition (OCR) is a procedure of identifying text in scanned (image-based) documents. This paper aims to deliver seamless searching of documents in file systems using Optical Character Recognition (OCR) and Natural Language Processing (NLP). Our paper includes the following phases: \"Text Identification (in terms of text files), Image Capturing, Image Enhancement, Image Identification, OCR, Data Extraction and Quality Assurance\". In case of text files, the data extraction is done in the first phase itself. The document management system \"processes both structured document images (ones which have a standard format) and unstructured document images\" (ones which do not have a standard format). In the tagging phase, the document is divided into segments and the tags for each segment are generated using Natural Language Processing.","PeriodicalId":407202,"journal":{"name":"2022 International Conference on Automation, Computing and Renewable Systems (ICACRS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Intelligent Document Finding using Optical Character Recognition and Tagging\",\"authors\":\"A. M. Abbas, M. S. Hameed, S. Balakrishnan, K. Anandh\",\"doi\":\"10.1109/ICACRS55517.2022.10029142\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the era of digitalization, the assortment and exploration of great volumes of documents is becoming progressively significant for enterprises to increase their productions and practices. Optical Character Recognition (OCR) is a procedure of identifying text in scanned (image-based) documents. This paper aims to deliver seamless searching of documents in file systems using Optical Character Recognition (OCR) and Natural Language Processing (NLP). Our paper includes the following phases: \\\"Text Identification (in terms of text files), Image Capturing, Image Enhancement, Image Identification, OCR, Data Extraction and Quality Assurance\\\". In case of text files, the data extraction is done in the first phase itself. The document management system \\\"processes both structured document images (ones which have a standard format) and unstructured document images\\\" (ones which do not have a standard format). In the tagging phase, the document is divided into segments and the tags for each segment are generated using Natural Language Processing.\",\"PeriodicalId\":407202,\"journal\":{\"name\":\"2022 International Conference on Automation, Computing and Renewable Systems (ICACRS)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Automation, Computing and Renewable Systems (ICACRS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACRS55517.2022.10029142\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Automation, Computing and Renewable Systems (ICACRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACRS55517.2022.10029142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Intelligent Document Finding using Optical Character Recognition and Tagging
In the era of digitalization, the assortment and exploration of great volumes of documents is becoming progressively significant for enterprises to increase their productions and practices. Optical Character Recognition (OCR) is a procedure of identifying text in scanned (image-based) documents. This paper aims to deliver seamless searching of documents in file systems using Optical Character Recognition (OCR) and Natural Language Processing (NLP). Our paper includes the following phases: "Text Identification (in terms of text files), Image Capturing, Image Enhancement, Image Identification, OCR, Data Extraction and Quality Assurance". In case of text files, the data extraction is done in the first phase itself. The document management system "processes both structured document images (ones which have a standard format) and unstructured document images" (ones which do not have a standard format). In the tagging phase, the document is divided into segments and the tags for each segment are generated using Natural Language Processing.