Scholarly big data information extraction and integration in the CiteSeerχ digital library

2014 IEEE 30th International Conference on Data Engineering Workshops Pub Date : 2014-05-19 DOI:10.1109/ICDEW.2014.6818305

Kyle Williams, Jian Wu, Sagnik Ray Choudhury, Madian Khabsa, C. Lee Giles

{"title":"Scholarly big data information extraction and integration in the CiteSeerχ digital library","authors":"Kyle Williams, Jian Wu, Sagnik Ray Choudhury, Madian Khabsa, C. Lee Giles","doi":"10.1109/ICDEW.2014.6818305","DOIUrl":null,"url":null,"abstract":"CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeerχ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2014.6818305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

Abstract

CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeerχ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.

查看原文本刊更多论文

CiteSeerχ数字图书馆学术大数据信息提取与集成

CiteSeerχ是一个数字图书馆，包含大约350万份学术文献，每天收到200万到400万份请求。除了通过公共网站提供文档外，这些数据还用于促进引文分析、合著者网络分析、可扩展性评估和信息提取等领域的研究。CiteSeerχ中的论文通过连续的自动聚焦爬行从网络中收集，并经过一系列自动处理步骤作为摄取过程的一部分。考虑到馆藏的规模，它不断扩大的事实，以及公众使用它获取学术文献和进行研究的多种方式，存在一些大数据挑战。在本文中，我们提供了一个案例研究描述，当涉及到CiteSeerχ的信息提取、数据集成和实体链接时，我们如何应对这些挑战。我们描述了我们如何:在Web上聚合来自多个来源的数据;存储和管理数据;将数据作为自动摄取管道的一部分进行处理，该管道包括自动元数据和信息提取;执行文档和引文聚类;执行实体链接和名称消歧;让我们的数据和源代码可用，以促进研究和合作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 30th International Conference on Data Engineering Workshops

自引率

0.00%

发文量