Pramodya Pathirana, Asini Silva, Thenuka Lawrence, T. Weerasinghe, Roshan Abeyweera
{"title":"A Comparative Evaluation of PDF-to-HTML Conversion Tools","authors":"Pramodya Pathirana, Asini Silva, Thenuka Lawrence, T. Weerasinghe, Roshan Abeyweera","doi":"10.1109/SCSE59836.2023.10214989","DOIUrl":null,"url":null,"abstract":"PDF (Portable Document Format) is a popular file format used for sharing and storing documents across different platforms. However, there are occasions when the content of a PDF document needs to be re-purposed for online use. PDF-toHTML conversion is a common method used to achieve this goal. This research paper presents a comparative evaluation of existing PDF-to-HTML conversion tools for their suitability in extracting text and images. These tools were tested using school textbooks in Sri Lanka, which contain complex text formatting and non-textual elements. The evaluation was based on various criteria, such as the accuracy of the output, handling of complex text formatting, and non-textual elements. Comparisons were drawn based on the performance of each of these tools with respect to the criteria. The study provides useful insights for individuals and organizations looking to re-purpose PDF content for online use in the HTML format, particularly in the education sector.","PeriodicalId":429228,"journal":{"name":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCSE59836.2023.10214989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
PDF (Portable Document Format) is a popular file format used for sharing and storing documents across different platforms. However, there are occasions when the content of a PDF document needs to be re-purposed for online use. PDF-toHTML conversion is a common method used to achieve this goal. This research paper presents a comparative evaluation of existing PDF-to-HTML conversion tools for their suitability in extracting text and images. These tools were tested using school textbooks in Sri Lanka, which contain complex text formatting and non-textual elements. The evaluation was based on various criteria, such as the accuracy of the output, handling of complex text formatting, and non-textual elements. Comparisons were drawn based on the performance of each of these tools with respect to the criteria. The study provides useful insights for individuals and organizations looking to re-purpose PDF content for online use in the HTML format, particularly in the education sector.