实践中的数据整理：使用数据分析工具从PDF文件中提取表格数据

Journal of escience librarianship Pub Date : 2021-08-11 DOI:10.7191/jeslib.2021.1209

A. J. Choi, Xuying Xin

{"title":"实践中的数据整理：使用数据分析工具从PDF文件中提取表格数据","authors":"A. J. Choi, Xuying Xin","doi":"10.7191/jeslib.2021.1209","DOIUrl":null,"url":null,"abstract":"Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool\",\"authors\":\"A. J. Choi, Xuying Xin\",\"doi\":\"10.7191/jeslib.2021.1209\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.\",\"PeriodicalId\":90214,\"journal\":{\"name\":\"Journal of escience librarianship\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of escience librarianship\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.7191/jeslib.2021.1209\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.2021.1209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据管理是管理数据的过程，目的是使数据可以重用和保存，并允许对数据进行FAIR(可查找、可访问、可互操作、可重用)的使用。这是研究生命周期的重要组成部分，因为研究人员通常要么被资助者要求，要么被鼓励保存数据集，使其可发现和可重用。随着开放获取(OA)政策在全国许多机构中实施，这一点尤为重要。在促进研究数据发现和提高其更容易重用的过程中，高效的数据存储库及其数据管理起着关键作用。在本文中，我们简要讨论了宾夕法尼亚州立大学的本地机构存储库以及我们为存储的文件和数据集采用的一般数据管理实践，然后我们将重点放在最近用于从PDF文件中提取表格数据的数据分析工具上。这是对现有数据管理实践的增强，因为它将额外的表格数据添加到包含PDF文件的存储中，而PDF文件中通常嵌入表格且不易重用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool

Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of escience librarianship

自引率

0.00%

发文量

审稿时长

16 weeks