网络中的图形结构——在不同聚合层次上的分析

J. Web Sci. Pub Date : 2015-07-13 DOI:10.1561/106.00000003
R. Meusel, S. Vigna, O. Lehmberg, Christian Bizer
{"title":"网络中的图形结构——在不同聚合层次上的分析","authors":"R. Meusel, S. Vigna, O. Lehmberg, Christian Bizer","doi":"10.1561/106.00000003","DOIUrl":null,"url":null,"abstract":"Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).","PeriodicalId":405637,"journal":{"name":"J. Web Sci.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":"{\"title\":\"The Graph Structure in the Web - Analyzed on Different Aggregation Levels\",\"authors\":\"R. Meusel, S. Vigna, O. Lehmberg, Christian Bizer\",\"doi\":\"10.1561/106.00000003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).\",\"PeriodicalId\":405637,\"journal\":{\"name\":\"J. Web Sci.\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"95\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Web Sci.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1561/106.00000003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Web Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/106.00000003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 95

摘要

关于万维网的一般图结构的知识对于理解控制其增长的社会机制、设计排名方法、设计更好的爬行算法以及创建其结构的准确模型非常重要。在本文中,我们分析了一个大型网络图。这张图是从公共抓取基金会(Common crawl Foundation)在2012年收集的一个大型可公开访问的网络抓取中提取的。该图表涵盖了超过35亿个网页和1287亿个超链接。我们分析和比较了弱/强连接组件的度分布、连通性、平均距离和结构等特征。我们在三个不同的聚合级别上进行分析:页面、主机和付费级域名(PLD)(公共后缀之上的一个“点级别”)。我们的分析表明,正如之前的研究(Serrano et al., 2007)所证明的那样,Broder et al.(2000)先前观察到的一些特征非常依赖于爬行过程的人工产物,而其他特征似乎更具有结构性。我们确认存在一个巨大的强连接分量;然而,我们发现,正如其他研究人员所观察到的(Donato et al., 2005;Boldi et al., 2002;Baeza-Yates和Poblete, 2003),可以到达或可以从巨型组件到达的节点比例非常不同,这表明Broder等人所描述的“领结结构”强烈依赖于爬行过程,据我们目前所知,这并不是Web的结构属性。更重要的是,统计测试和大小排序图的视觉检查表明,页面和主机图的强连接组件的度、度和大小的分布不是幂律,这与之前报道的小得多的爬行相反,尽管它们可能是重尾的。然而,如果我们在付费级别领域进行聚合,就会出现幂律。我们还首次提供了基于距离的特征的精确测量,使用最近引入的算法,该算法可扩展到我们的爬行大小(Boldi和Vigna, 2013)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The Graph Structure in the Web - Analyzed on Different Aggregation Levels
Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信