在大规模Web档案中搜索历史事件

Lian'en Huang, Wu Lin, Xiaoming Li
{"title":"在大规模Web档案中搜索历史事件","authors":"Lian'en Huang, Wu Lin, Xiaoming Li","doi":"10.1109/SKG.2010.37","DOIUrl":null,"url":null,"abstract":"Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines' in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.","PeriodicalId":105513,"journal":{"name":"2010 Sixth International Conference on Semantics, Knowledge and Grids","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Searching for Historical Events on a Large-Scale Web Archive\",\"authors\":\"Lian'en Huang, Wu Lin, Xiaoming Li\",\"doi\":\"10.1109/SKG.2010.37\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines' in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.\",\"PeriodicalId\":105513,\"journal\":{\"name\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SKG.2010.37\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Sixth International Conference on Semantics, Knowledge and Grids","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKG.2010.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在网络上查找知识一直是一个研究热点问题。今天,网络已经成为发布新闻和评论文章的流行媒介,它们是人类知识特别是社会知识的重要载体。因此,开发大规模自动收集和分析这些物品的技术是可取的。在本文中,我们提出了在Web上搜索事件的技术,我们的技术已经在一个大规模的Web存档中进行了测试。给定一个事件,或者一个很多人关心的新闻话题,本文的目的是找出几乎所有与之相关的新闻故事。首先,提出了一种适用于大多数网站的基于锚文本的独立于领域的网页新闻故事提取方法。实验表明,我们的方法性能良好,优于我们发现的另一种方法。其次,提出了一种基于域的事件表示方法,其中使用数百个关键字来表示事件并组成查询表达式。这种检索的情况与大多数搜索引擎的不同之处在于关键词的数量很大。针对该方法,提出了几种基于BM25的检索算法。在我们的情况下,这些算法的性能优于未修改的BM25算法,并选择了其中最好的算法作为我们系统的算法。最后在20亿个网页的集合上建立了一个实验系统,并报告了运行性能,证明了我们的方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Searching for Historical Events on a Large-Scale Web Archive
Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines' in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信