大规模水印科学文献的文本提取与分类

2018 3rd International Conference on Computational Intelligence and Applications (ICCIA) Pub Date : 2018-07-01 DOI:10.1109/ICCIA.2018.00017

Wai Chong Chia, P. Teh, C. M. Gill

{"title":"大规模水印科学文献的文本提取与分类","authors":"Wai Chong Chia, P. Teh, C. M. Gill","doi":"10.1109/ICCIA.2018.00017","DOIUrl":null,"url":null,"abstract":"Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.","PeriodicalId":297098,"journal":{"name":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text Extraction and Categorization from Watermark Scientific Document in Bulk\",\"authors\":\"Wai Chong Chia, P. Teh, C. M. Gill\",\"doi\":\"10.1109/ICCIA.2018.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.\",\"PeriodicalId\":297098,\"journal\":{\"name\":\"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCIA.2018.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIA.2018.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

如果所有这些都是在没有自动化系统的帮助下完成的，那么从大量以便携文件格式(PDF)准备的科学文件中提取信息是一个耗时的过程。然而，PDF中结构信息的缺失会在提取过程中产生很多问题。水印是一种可以对其产生负面影响的对象。当PDF提取工具应用于带水印的PDF时，水印会影响文本的顺序，通常作为文本的一部分被提取出来。如果文本将来用于分析，水印可能会影响结果的准确性，因为它们不应该被考虑在内。在本文中，提出了一种可以用来克服上述问题的方法。该方法利用PDF的直接文本识别和光学字符识别(OCR)来生成两个版本的数字文本，可以将它们组合在一起以提高准确性。结果表明，该方法能够从具有不同水印模式的PDF文件中提取文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text Extraction and Categorization from Watermark Scientific Document in Bulk

Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)

自引率

0.00%

发文量