Huu Thang Nguyen, Cong Linh Le, Hoai-Nam Tran, T. A. Tran
{"title":"A Study on Information Extraction: Application to Administrative Document Images","authors":"Huu Thang Nguyen, Cong Linh Le, Hoai-Nam Tran, T. A. Tran","doi":"10.1109/NICS56915.2022.10013381","DOIUrl":null,"url":null,"abstract":"This paper presents a study on the problem of information extraction and its application in building an information extraction system for administrative documents. The proposed end-to-end system contains three significant modules, including Text detection (TD), Optical character recognition (OCR), and Information extraction (IE). We developed the IE module by us based on two platforms, GraphSAGE and GATs. We have made many changes and improvements, such as redesigning graph modeling and node representation to match the goals and problems posed. We also elaborately studied to establish a complete information extraction system and dived into the information extraction module instead of all modules in the system. Besides that, we also built and evaluated our dataset of Vietnamese Administrative Documents Images (VADI2021).","PeriodicalId":381028,"journal":{"name":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS56915.2022.10013381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a study on the problem of information extraction and its application in building an information extraction system for administrative documents. The proposed end-to-end system contains three significant modules, including Text detection (TD), Optical character recognition (OCR), and Information extraction (IE). We developed the IE module by us based on two platforms, GraphSAGE and GATs. We have made many changes and improvements, such as redesigning graph modeling and node representation to match the goals and problems posed. We also elaborately studied to establish a complete information extraction system and dived into the information extraction module instead of all modules in the system. Besides that, we also built and evaluated our dataset of Vietnamese Administrative Documents Images (VADI2021).