CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-04-28 DOI:10.48550/arXiv.2304.14953

M. Turski, Tomasz Stanislawek, Karol Kaczmarek, Pawel Dyda, Filip Grali'nski

{"title":"CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data","authors":"M. Turski, Tomasz Stanislawek, Karol Kaczmarek, Pawel Dyda, Filip Grali'nski","doi":"10.48550/arXiv.2304.14953","DOIUrl":null,"url":null,"abstract":"In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.14953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.

查看原文本刊更多论文

CCpdf:从网络抓取数据为视觉丰富的文档构建高质量的语料库

近年来，文献理解领域取得了很大的进展。这一进步在很大程度上要归功于对大量文档进行预训练的语言模型的使用。然而，在文档理解领域中使用的预训练语料库是单一领域的、单语言的或非公开的。我们在本文中的目标是提出一种高效的管道，用于使用Common Crawl从整个Internet创建大规模、多样化、多语言的PDF文件语料库，因为PDF文件是文档理解中考虑的最规范的文档类型。我们广泛地分析了管道的所有步骤，并提出了在数据质量和处理时间之间进行权衡的解决方案。我们还以PDF文件索引的形式共享CCpdf语料库，以及用于下载它们的脚本，这产生了一个对语言模型预训练有用的集合。与本文一起发表的数据集和工具为研究人员提供了开发更好的多语言模型的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量