{"title":"从复杂文档图像中提取字符串的鲁棒技术","authors":"Yen-Lin Chen","doi":"10.1109/ITSIM.2008.4632015","DOIUrl":null,"url":null,"abstract":"A new technique for segmenting and extracting character strings from various real-life complex document images is proposed in this study. The proposed text extraction technique first decompose the document image into distinct object planes to extract and separate homogeneous objects including textual regions of interest, non-text objects such as graphics and pictures, and background textures. Then a text extraction procedure is applied to the resultant planes to extract character strings with different characteristics in the corresponding planes. The document image is processed regionally and adaptively according to its local features, and thus detailed characteristics of extracted textual objects can be well-preserved, especially small characters with thin strokes. From the experimental results and comparisons to the existing technique, the proposed approach demonstrates its effectiveness and advantages on extracting character strings with various illuminations, sizes, and font styles from various types of complex document images.","PeriodicalId":314159,"journal":{"name":"2008 International Symposium on Information Technology","volume":"82 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A robust technique for character string extraction from complex document images\",\"authors\":\"Yen-Lin Chen\",\"doi\":\"10.1109/ITSIM.2008.4632015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A new technique for segmenting and extracting character strings from various real-life complex document images is proposed in this study. The proposed text extraction technique first decompose the document image into distinct object planes to extract and separate homogeneous objects including textual regions of interest, non-text objects such as graphics and pictures, and background textures. Then a text extraction procedure is applied to the resultant planes to extract character strings with different characteristics in the corresponding planes. The document image is processed regionally and adaptively according to its local features, and thus detailed characteristics of extracted textual objects can be well-preserved, especially small characters with thin strokes. From the experimental results and comparisons to the existing technique, the proposed approach demonstrates its effectiveness and advantages on extracting character strings with various illuminations, sizes, and font styles from various types of complex document images.\",\"PeriodicalId\":314159,\"journal\":{\"name\":\"2008 International Symposium on Information Technology\",\"volume\":\"82 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 International Symposium on Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITSIM.2008.4632015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Symposium on Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITSIM.2008.4632015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A robust technique for character string extraction from complex document images
A new technique for segmenting and extracting character strings from various real-life complex document images is proposed in this study. The proposed text extraction technique first decompose the document image into distinct object planes to extract and separate homogeneous objects including textual regions of interest, non-text objects such as graphics and pictures, and background textures. Then a text extraction procedure is applied to the resultant planes to extract character strings with different characteristics in the corresponding planes. The document image is processed regionally and adaptively according to its local features, and thus detailed characteristics of extracted textual objects can be well-preserved, especially small characters with thin strokes. From the experimental results and comparisons to the existing technique, the proposed approach demonstrates its effectiveness and advantages on extracting character strings with various illuminations, sizes, and font styles from various types of complex document images.