Generating summary documents for a variable-quality PDF document collection

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI:10.1145/2644866.2644892

Jacob Hughes, D. Brailsford, S. Bagley, C. Adams

{"title":"Generating summary documents for a variable-quality PDF document collection","authors":"Jacob Hughes, D. Brailsford, S. Bagley, C. Adams","doi":"10.1145/2644866.2644892","DOIUrl":null,"url":null,"abstract":"The Cochrane Schizophrenia Group's Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort -- on a given theme but gathered from a wide range of sources -- will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability.\n Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised -- with enormous duplication of effort and with many issues around the validity and reliability of the data extraction.\n This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases. Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"34 1","pages":"49-52"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2644866.2644892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The Cochrane Schizophrenia Group's Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort -- on a given theme but gathered from a wide range of sources -- will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability. Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised -- with enormous duplication of effort and with many issues around the validity and reliability of the data extraction. This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases. Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated.

查看原文本刊更多论文

为可变质量的PDF文档集合生成摘要文档

Cochrane精神分裂症小组的研究记录详细介绍了治疗精神分裂症患者的所有方面的效果。它是在过去20年里收集的，由大约2万份文件组成，绝大多数是PDF格式的。这种类型的文档集合——在给定的主题上，但从广泛的来源收集——通常会在PDF的质量上有很大的变化，特别是在文本可搜索性的关键属性方面。迄今为止，从相关PDF文件的视觉检查开始，通过手动创建摘要文档来总结这些最佳论文的结果，以允许基于证据的卫生保健决策。这一劳动密集型的过程导致，到目前为止，只有4000篇论文被总结出来——大量的重复工作，以及围绕数据提取的有效性和可靠性的许多问题。本文描述了一个提供计算机辅助框架的试点项目，在该框架中，任何PDF文档都可以搜索大约8000个关键字和关键短语。一旦关键字标签完成，框架就会协助生成标准摘要文档，从而大大加快这些摘要的生成速度。描述了该框架的早期示例并说明了其功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

自引率

0.00%

发文量