{"title":"大规模水印科学文献的文本提取与分类","authors":"Wai Chong Chia, P. Teh, C. M. Gill","doi":"10.1109/ICCIA.2018.00017","DOIUrl":null,"url":null,"abstract":"Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.","PeriodicalId":297098,"journal":{"name":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text Extraction and Categorization from Watermark Scientific Document in Bulk\",\"authors\":\"Wai Chong Chia, P. Teh, C. M. Gill\",\"doi\":\"10.1109/ICCIA.2018.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.\",\"PeriodicalId\":297098,\"journal\":{\"name\":\"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCIA.2018.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIA.2018.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text Extraction and Categorization from Watermark Scientific Document in Bulk
Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.