The Link Database: fast access to graphs of the Web

Proceedings DCC 2002. Data Compression Conference Pub Date : 2002-04-02 DOI:10.1109/DCC.2002.999950

K. H. Randall, Raymie Stata, J. Wiener, Rajiv Wickremesinghe

{"title":"The Link Database: fast access to graphs of the Web","authors":"K. H. Randall, Raymie Stata, J. Wiener, Rajiv Wickremesinghe","doi":"10.1109/DCC.2002.999950","DOIUrl":null,"url":null,"abstract":"The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.","PeriodicalId":420897,"journal":{"name":"Proceedings DCC 2002. Data Compression Conference","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"146","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC 2002. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2002.999950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 146

Abstract

The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.

查看原文本刊更多论文

链接数据库:快速访问Web图形

Connectivity Server是一个特殊用途的数据库，其模式将Web建模为一个图:一组节点(URL)通过有向边(超链接)连接。链接数据库提供了对超链接的快速访问。为了支持各种图形算法的轻松实现，我们发现将链接数据库装入RAM非常重要。在Link Database的第一个版本中，我们使用具有大量内存(8 GB)的机器，并将每个超链接存储为32位，从而实现了这种匹配。但是，这种方法仅限于大约1亿个Web页面。本文介绍了压缩链接以容纳更大图形的技术。我们的技术结合了众所周知的压缩方法和依赖于Web图属性的方法。第一种压缩技术利用了这样一个事实，即大多数Web页面上的大多数超链接都指向与页面本身位于同一主机上的其他页面。第二种技术利用了同一主机上的许多页面共享超链接的事实，也就是说，它们往往指向一组公共页面。总之，这些技术将每条链路的空间需求降低到6位以下。虽然(去)压缩增加了超链接访问时间的延迟，但我们仍然可以在22分钟内计算60亿边图的强连接组件，并实时运行Kleinberg的HITS等应用程序。本文描述了我们压缩链接数据库的技术，并提供了压缩比和解压缩速度的性能数字。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC 2002. Data Compression Conference

自引率

0.00%

发文量