Random walks with variable restarts for negative-example-informed label propagation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery Pub Date : 2024-08-13 DOI:10.1007/s10618-024-01065-4

Sean Maxwell, Mehmet Koyutürk

{"title":"Random walks with variable restarts for negative-example-informed label propagation","authors":"Sean Maxwell, Mehmet Koyutürk","doi":"10.1007/s10618-024-01065-4","DOIUrl":null,"url":null,"abstract":"Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts\", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"41 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01065-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.

Abstract Image

查看原文本刊更多论文

带可变重启的随机游走，用于负示例信息标签传播

在图的机器学习和数据挖掘应用中，经常会遇到标签传播问题，它既可以作为一个独立问题，也可以作为节点分类的一部分。许多标签传播算法利用随机行走（或网络传播），这种算法考虑负标签节点（即已知与相关标签无关的节点）的能力有限。纳入负标签节点的专门算法一般侧重于学习或重新调整边缘权重，以驱动行走远离负标签节点，转向正标签节点。这种方法有几个缺点，因为它增加了需要学习的参数数量，而且不一定能使行走远离负标签节点丰富的网络区域。我们对带有重启和网络传播的随机行走进行了重新表述，以实现 "可变重启"，即在遇到负标签节点时，增加在正标签节点重启的可能性。在此基础上，我们开发了 CusTaRd 算法，它有效地结合了可变重启概率和边缘重加权以避免负标签节点。为了评估 CusTaRd 的性能，我们在标签传播和节点分类算法基准测试中常用的网络数据集上进行了综合实验。我们的结果表明，CusTaRd 的性能始终优于学习边缘权重或重新启动轮廓的竞争算法，而且靠近正例的负例通常比距离较远的负例更有信息量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.