Inconsistency among evaluation metrics in link prediction.

IF 2.2 Q2 MULTIDISCIPLINARY SCIENCES
PNAS nexus Pub Date : 2024-11-06 eCollection Date: 2024-11-01 DOI:10.1093/pnasnexus/pgae498
Yilin Bi, Xinshan Jiao, Yan-Li Lee, Tao Zhou
{"title":"Inconsistency among evaluation metrics in link prediction.","authors":"Yilin Bi, Xinshan Jiao, Yan-Li Lee, Tao Zhou","doi":"10.1093/pnasnexus/pgae498","DOIUrl":null,"url":null,"abstract":"<p><p>Link prediction is a paradigmatic and challenging problem in network science, which aims to predict missing links, future links, and temporal links based on known topology. Along with the increasing number of link prediction algorithms, a critical yet previously ignored risk is that the evaluation metrics for algorithm performance are usually chosen at will. This paper implements extensive experiments on hundreds of real networks and 26 well-known algorithms, revealing significant inconsistency among evaluation metrics, namely different metrics probably produce remarkably different rankings of algorithms. Therefore, we conclude that any single metric cannot comprehensively or credibly evaluate algorithm performance. In terms of information content, we suggest the usage of at least two metrics: one is the area under the receiver operating characteristic curve, and the other is one of the following three candidates, say the area under the precision-recall curve, the area under the precision curve, and the normalized discounted cumulative gain. When the data are imbalanced, say the number of negative samples significantly outweighs the number of positive samples, the area under the generalized Receiver Operating Characteristic curve should also be used. In addition, as we have proved the essential equivalence of threshold-dependent metrics, if in a link prediction task, some specific thresholds are meaningful, we can consider any one threshold-dependent metric with those thresholds. This work completes a missing part in the landscape of link prediction, and provides a starting point toward a well-accepted criterion or standard to select proper evaluation metrics for link prediction.</p>","PeriodicalId":74468,"journal":{"name":"PNAS nexus","volume":"3 11","pages":"pgae498"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11574622/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PNAS nexus","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/pnasnexus/pgae498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Link prediction is a paradigmatic and challenging problem in network science, which aims to predict missing links, future links, and temporal links based on known topology. Along with the increasing number of link prediction algorithms, a critical yet previously ignored risk is that the evaluation metrics for algorithm performance are usually chosen at will. This paper implements extensive experiments on hundreds of real networks and 26 well-known algorithms, revealing significant inconsistency among evaluation metrics, namely different metrics probably produce remarkably different rankings of algorithms. Therefore, we conclude that any single metric cannot comprehensively or credibly evaluate algorithm performance. In terms of information content, we suggest the usage of at least two metrics: one is the area under the receiver operating characteristic curve, and the other is one of the following three candidates, say the area under the precision-recall curve, the area under the precision curve, and the normalized discounted cumulative gain. When the data are imbalanced, say the number of negative samples significantly outweighs the number of positive samples, the area under the generalized Receiver Operating Characteristic curve should also be used. In addition, as we have proved the essential equivalence of threshold-dependent metrics, if in a link prediction task, some specific thresholds are meaningful, we can consider any one threshold-dependent metric with those thresholds. This work completes a missing part in the landscape of link prediction, and provides a starting point toward a well-accepted criterion or standard to select proper evaluation metrics for link prediction.

链接预测中评价指标的不一致性。
链接预测是网络科学中一个典型而又具有挑战性的问题,其目的是根据已知拓扑预测缺失链接、未来链接和时间链接。随着链接预测算法数量的不断增加,一个关键但以前被忽视的风险是,算法性能的评估指标通常是随意选择的。本文在数百个真实网络和 26 种著名算法上进行了大量实验,发现评价指标之间存在显著的不一致性,即不同的指标可能会产生明显不同的算法排名。因此,我们得出结论,任何单一指标都无法全面或可信地评估算法性能。就信息含量而言,我们建议至少使用两个指标:一个是接收者操作特征曲线下的面积,另一个是以下三个候选指标之一,如精确度-调用曲线下的面积、精确度曲线下的面积和归一化折现累积增益。当数据不平衡时,例如负样本的数量明显多于正样本的数量,也应使用广义接收者工作特征曲线下的面积。此外,由于我们已经证明了阈值相关度量的基本等价性,因此,如果在链接预测任务中,某些特定的阈值是有意义的,我们就可以考虑任何一个具有这些阈值的阈值相关度量。这项工作填补了链接预测领域的一个空白,并提供了一个起点,为链接预测选择适当的评估指标提供了一个公认的标准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信