Aarush Gupta, Akhil Chawla, K. S. Shushrutha, Mohana
{"title":"智能信息检索:处理文档结构中的可变性","authors":"Aarush Gupta, Akhil Chawla, K. S. Shushrutha, Mohana","doi":"10.1109/ICOSEC54921.2022.9951912","DOIUrl":null,"url":null,"abstract":"Every corporation’s day-to-day activities entail dealing with a vast array of diverse data formats, such as work orders, techno’s, maintenance papers, and so on, many of which are selectable or scanned PDFs. These tasks demand several hours of human labor to extract the necessary data from these papers for further processing, and analysis incurring significant financial toll to these corporations. As a result, there is enormous potential for the creation of a digital solution that enables sophisticated OCR implementation, leading to the automation of the entire information extraction process. This paper provides a thorough examination of information extraction process focusing to deliver a high-quality complete functional solution and suggests a solution that incorporates critical preprocessing required for accurate information extraction and makes use of the capabilities of Faster R-CNN for document layout analysis as well as a range of approaches for efficient data extraction depending on data type. The multistage document analysis and information extraction tool also provides options for template definition enabling their reusability for batch processing large amounts of unstructured data.","PeriodicalId":221953,"journal":{"name":"2022 3rd International Conference on Smart Electronics and Communication (ICOSEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Intelligent Information Retrieval: Handling Variability in Document Structure\",\"authors\":\"Aarush Gupta, Akhil Chawla, K. S. Shushrutha, Mohana\",\"doi\":\"10.1109/ICOSEC54921.2022.9951912\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Every corporation’s day-to-day activities entail dealing with a vast array of diverse data formats, such as work orders, techno’s, maintenance papers, and so on, many of which are selectable or scanned PDFs. These tasks demand several hours of human labor to extract the necessary data from these papers for further processing, and analysis incurring significant financial toll to these corporations. As a result, there is enormous potential for the creation of a digital solution that enables sophisticated OCR implementation, leading to the automation of the entire information extraction process. This paper provides a thorough examination of information extraction process focusing to deliver a high-quality complete functional solution and suggests a solution that incorporates critical preprocessing required for accurate information extraction and makes use of the capabilities of Faster R-CNN for document layout analysis as well as a range of approaches for efficient data extraction depending on data type. The multistage document analysis and information extraction tool also provides options for template definition enabling their reusability for batch processing large amounts of unstructured data.\",\"PeriodicalId\":221953,\"journal\":{\"name\":\"2022 3rd International Conference on Smart Electronics and Communication (ICOSEC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 3rd International Conference on Smart Electronics and Communication (ICOSEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICOSEC54921.2022.9951912\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 3rd International Conference on Smart Electronics and Communication (ICOSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSEC54921.2022.9951912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Intelligent Information Retrieval: Handling Variability in Document Structure
Every corporation’s day-to-day activities entail dealing with a vast array of diverse data formats, such as work orders, techno’s, maintenance papers, and so on, many of which are selectable or scanned PDFs. These tasks demand several hours of human labor to extract the necessary data from these papers for further processing, and analysis incurring significant financial toll to these corporations. As a result, there is enormous potential for the creation of a digital solution that enables sophisticated OCR implementation, leading to the automation of the entire information extraction process. This paper provides a thorough examination of information extraction process focusing to deliver a high-quality complete functional solution and suggests a solution that incorporates critical preprocessing required for accurate information extraction and makes use of the capabilities of Faster R-CNN for document layout analysis as well as a range of approaches for efficient data extraction depending on data type. The multistage document analysis and information extraction tool also provides options for template definition enabling their reusability for batch processing large amounts of unstructured data.