实践中的数据整理:使用数据分析工具从PDF文件中提取表格数据

A. J. Choi, Xuying Xin
{"title":"实践中的数据整理:使用数据分析工具从PDF文件中提取表格数据","authors":"A. J. Choi, Xuying Xin","doi":"10.7191/jeslib.2021.1209","DOIUrl":null,"url":null,"abstract":"Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool\",\"authors\":\"A. J. Choi, Xuying Xin\",\"doi\":\"10.7191/jeslib.2021.1209\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.\",\"PeriodicalId\":90214,\"journal\":{\"name\":\"Journal of escience librarianship\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of escience librarianship\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.7191/jeslib.2021.1209\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.2021.1209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

数据管理是管理数据的过程,目的是使数据可以重用和保存,并允许对数据进行FAIR(可查找、可访问、可互操作、可重用)的使用。这是研究生命周期的重要组成部分,因为研究人员通常要么被资助者要求,要么被鼓励保存数据集,使其可发现和可重用。随着开放获取(OA)政策在全国许多机构中实施,这一点尤为重要。在促进研究数据发现和提高其更容易重用的过程中,高效的数据存储库及其数据管理起着关键作用。在本文中,我们简要讨论了宾夕法尼亚州立大学的本地机构存储库以及我们为存储的文件和数据集采用的一般数据管理实践,然后我们将重点放在最近用于从PDF文件中提取表格数据的数据分析工具上。这是对现有数据管理实践的增强,因为它将额外的表格数据添加到包含PDF文件的存储中,而PDF文件中通常嵌入表格且不易重用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool
Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信