ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

S. Kahu, William A. Ingram, E. Fox, Jian Wu
{"title":"ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations","authors":"S. Kahu, William A. Ingram, E. Fox, Jian Wu","doi":"10.1109/JCDL52503.2021.00030","DOIUrl":null,"url":null,"abstract":"We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL52503.2021.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.
ScanBank:从扫描电子论文和学位论文中提取图形的基准数据集
我们专注于电子论文和学位论文(ETDs),旨在改善访问和扩大其效用,因为超过600万的电子论文和学位论文是公开的,它们构成了一个重要的语料库,以帮助跨学科的研究和教育。随着新诞生的数字文档的加入,语料库也在不断增长,因为数以百万计的旧论文和学位论文已被转换为数字形式,以便在机构存储库中以电子方式传播。在ETDs中,与其他学术著作一样,数字和表格可以以简洁的方式传达大量信息。虽然已经提出了从原生数字pdf中提取图形和表格的方法,但它们不适用于扫描的etd。考虑到这个问题,我们对最先进的图形提取系统的评估是,它们不能很好地处理扫描的pdf文件的原因是它们只接受过原生数字文档的训练。为了解决这一限制,我们提出了ScanBank,这是一个包含1万个扫描页面图像的新数据集,由人类手动标记其中发现的3.3万个图形或表格的存在。我们利用该数据集训练了一个基于YOLOv5的深度神经网络模型,以准确地从扫描的etd中提取图和表。我们提出并回答重要的研究问题,旨在找到更好的方法从扫描文件中提取图形。其中之一涉及应用于原生数字文档的数据增强技术的训练价值,该技术用于训练更适合从扫描文档中提取图形的模型。据我们所知,ScanBank是第一个用于扫描etd的图和表提取的手动注释数据集。在ScanBank上训练的基于yolov5的模型,在相当程度上优于现有的可比较的开源和免费的基线方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信