What's there and what's not?: focused crawling for missing documents in digital libraries

Ziming Zhuang, R. Wagle, C. Lee Giles
{"title":"What's there and what's not?: focused crawling for missing documents in digital libraries","authors":"Ziming Zhuang, R. Wagle, C. Lee Giles","doi":"10.1145/1065385.1065455","DOIUrl":null,"url":null,"abstract":"Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures","PeriodicalId":248721,"journal":{"name":"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1065385.1065455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 51

Abstract

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures
有什么,没有什么?:集中抓取数字图书馆中丢失的文档
一些大型主题数字图书馆,如CiteSeer,通过抓取开放存取档案、大学和作者的主页以及作者的自我提交来获取在线学术文献。虽然这些方法迄今为止已经构建了合理规模的库,但它们可能只拥有来自特定出版场所的部分文档。我们建议使用替代的在线资源和技术,最大限度地利用其他资源来构建任何给定出版场所的完整文档集合。我们研究了使用出版物元数据来引导爬虫到作者的主页以获取数字图书馆馆藏中缺失的内容的可行性。我们从两个计算机科学出版场所收集了一个真实世界的数据集,涉及1998年至2004年期间总共593位独特的作者。然后我们找出没有被CiteSeer索引的缺失论文。使用一个全自动的启发式系统,该系统能够定位作者的主页,然后使用集中爬虫下载所需的论文,我们证明了使用集中爬虫获取数字图书馆中缺失的学术论文是可行的。我们的收割机实现了平均召回水平为0.82的性能,对于那些丢失的文档达到了0.75。基于收集率对爬虫性能的评估显示出与其他爬虫方法相比的明显优势,并且在许多度量指标上始终优于定义的基线爬虫
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信