{"title":"DeepParse: A Trainable Postal Address Parser","authors":"N. Abid, A. Ul-Hasan, F. Shafait","doi":"10.1109/DICTA.2018.8615844","DOIUrl":null,"url":null,"abstract":"Postal applications are among the first beneficiaries of the advancements in document image processing techniques due to their economic significance. To automate the process of postal services, it is necessary to integrate contributions from a wide range of image processing domains, from image acquisition and preprocessing to interpretation through symbol, character and word recognition. Lately, machine learning approaches are deployed for postal address processing. Parsing problem has been explored using different techniques, like regular expressions, Conditional Random Fields (CRFs), Hidden Markov Models (HMMs), Decision Trees and Support Vector Machines (SVMs). These traditional techniques are designed on the assumption that the data is free from OCR errors which decreases the adaptability of the architecture in the real-world scenarios. Furthermore, their performance is affected in the presence of non-standardized addresses resulting in intermixing of similar classes. In this paper, we present the first trainable neural network based robust architecture DeepParse for postal address parsing that tackles these issues and can be applied to any Named Entity Recognition (NER) problem. The architecture takes the input at different granularity levels: characters, trigram characters and words to extract and learn the features and classify the addresses. The model was trained on a synthetically generated dataset and tested on the real-world addresses. DeepParse has also been tested on the NER dataset i.e. CoNLL2003 and gave the result of 90.44% which is on par with the state-of-art technique.","PeriodicalId":130057,"journal":{"name":"2018 Digital Image Computing: Techniques and Applications (DICTA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA.2018.8615844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
Postal applications are among the first beneficiaries of the advancements in document image processing techniques due to their economic significance. To automate the process of postal services, it is necessary to integrate contributions from a wide range of image processing domains, from image acquisition and preprocessing to interpretation through symbol, character and word recognition. Lately, machine learning approaches are deployed for postal address processing. Parsing problem has been explored using different techniques, like regular expressions, Conditional Random Fields (CRFs), Hidden Markov Models (HMMs), Decision Trees and Support Vector Machines (SVMs). These traditional techniques are designed on the assumption that the data is free from OCR errors which decreases the adaptability of the architecture in the real-world scenarios. Furthermore, their performance is affected in the presence of non-standardized addresses resulting in intermixing of similar classes. In this paper, we present the first trainable neural network based robust architecture DeepParse for postal address parsing that tackles these issues and can be applied to any Named Entity Recognition (NER) problem. The architecture takes the input at different granularity levels: characters, trigram characters and words to extract and learn the features and classify the addresses. The model was trained on a synthetically generated dataset and tested on the real-world addresses. DeepParse has also been tested on the NER dataset i.e. CoNLL2003 and gave the result of 90.44% which is on par with the state-of-art technique.