{"title":"审查网络链接","authors":"Na Dai, Brian D. Davison","doi":"10.1145/1645953.1646220","DOIUrl":null,"url":null,"abstract":"Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-of-date. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold cross-validation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75% of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Vetting the links of the web\",\"authors\":\"Na Dai, Brian D. Davison\",\"doi\":\"10.1145/1645953.1646220\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-of-date. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold cross-validation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75% of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.\",\"PeriodicalId\":286251,\"journal\":{\"name\":\"Proceedings of the 18th ACM conference on Information and knowledge management\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th ACM conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1645953.1646220\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1645953.1646220","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
许多web链接误导了人类冲浪者和自动爬虫,因为它们指向更改的内容、过时的信息或无效的url。对于大型知名目录(如dmoz Open Directory Project)来说,这是一个特别的问题,该目录在其各种主题中维护到具有代表性和权威性的外部网页的链接。因此,这样的网站需要许多编辑手动重新访问和修改已经过时的链接。为了纠正这种情况,我们提出了一种新的网络挖掘任务,即识别网络上的过时链接。我们建立了一个通用的分类模型,主要使用从历史内容、主题、链接和时间变化中提取的局部和全局时间特征。我们通过对从13个顶级类别中选择的15000多个ODP外部链接进行五倍交叉验证来评估我们的系统。我们的系统能够在75%以上的时间内预测ODP编辑的动作。我们的模型和预测对各种依赖于网络链接分析的应用程序很有用,包括排名和抓取。
Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-of-date. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold cross-validation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75% of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.