{"title":"基于文本检测的PDF到HTML转换方法","authors":"Deliang Jiang, Xiaohu Yang","doi":"10.1145/1655925.1656103","DOIUrl":null,"url":null,"abstract":"Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of the PDF document of the open source tool PDFBox, the paper described a method that can detect the layout information of the PDF document and convert the PDF document to HTML page effectively.","PeriodicalId":122831,"journal":{"name":"Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Converting PDF to HTML approach based on text detection\",\"authors\":\"Deliang Jiang, Xiaohu Yang\",\"doi\":\"10.1145/1655925.1656103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of the PDF document of the open source tool PDFBox, the paper described a method that can detect the layout information of the PDF document and convert the PDF document to HTML page effectively.\",\"PeriodicalId\":122831,\"journal\":{\"name\":\"Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human\",\"volume\":\"122 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1655925.1656103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1655925.1656103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Converting PDF to HTML approach based on text detection
Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of the PDF document of the open source tool PDFBox, the paper described a method that can detect the layout information of the PDF document and convert the PDF document to HTML page effectively.