{"title":"Random walks with variable restarts for negative-example-informed label propagation","authors":"Sean Maxwell, Mehmet Koyutürk","doi":"10.1007/s10618-024-01065-4","DOIUrl":null,"url":null,"abstract":"<p>Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts\", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop <span>CusTaRd</span>, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of <span>CusTaRd</span>, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that <span>CusTaRd</span> consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"41 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01065-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.