{"title":"Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection","authors":"Guixian Zhang, Shichao Zhang, Guan Yuan","doi":"10.1145/3639408","DOIUrl":"https://doi.org/10.1145/3639408","url":null,"abstract":"<p>It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A <b>B</b>ayesian graph <b>L</b>ocal extrema <b>C</b>onvolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"12 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie
{"title":"Finding Subgraphs with Maximum Total Density and Limited Overlap in Weighted Hypergraphs","authors":"Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie","doi":"10.1145/3639410","DOIUrl":"https://doi.org/10.1145/3639410","url":null,"abstract":"<p>Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most <i>k</i> subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"21 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie
{"title":"Concept Drift Adaptation by Exploiting Drift Type","authors":"Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie","doi":"10.1145/3638777","DOIUrl":"https://doi.org/10.1145/3638777","url":null,"abstract":"<p>Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"23 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu
{"title":"ID-SR: Privacy-Preserving Social Recommendation based on Infinite Divisibility for Trustworthy AI","authors":"Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu","doi":"10.1145/3639412","DOIUrl":"https://doi.org/10.1145/3639412","url":null,"abstract":"<p>Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"26 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng
{"title":"Utility-aware Privacy Perturbation for Training Data","authors":"Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng","doi":"10.1145/3639411","DOIUrl":"https://doi.org/10.1145/3639411","url":null,"abstract":"<p>Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a <i>γ</i>-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Semantics-enhanced Topic Modelling Technique: Semantic-LDA","authors":"Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li","doi":"10.1145/3639409","DOIUrl":"https://doi.org/10.1145/3639409","url":null,"abstract":"<p>Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"11 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple-Instance Learning from Triplet Comparison Bags","authors":"Senlin Shu, Deng-Bao Wang, Suqin Yuan, Hongxin Wei, Jiuchuan Jiang, Lei Feng, Min-Ling Zhang","doi":"10.1145/3638776","DOIUrl":"https://doi.org/10.1145/3638776","url":null,"abstract":"<p><i>Multiple-instance learning</i> (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, <i>triplet comparison data</i> can be collected in a more accurate and human-friendly way. Therefore, in this paper, we for the first time investigate MIL from <i>only triplet comparison bags</i>, where a triplet (<i>X<sub>a</sub></i>, <i>X<sub>b</sub></i>, <i>X<sub>c</sub></i>) contains the weak supervision information that bag <i>X<sub>a</sub></i> is more similar to <i>X<sub>b</sub></i> than to <i>X<sub>c</sub></i>. To solve this problem, we propose to train a bag-level classifier by the <i>empirical risk minimization</i> framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"30 8","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi
{"title":"HITS based Propagation Paradigm for Graph Neural Networks","authors":"Mehak Khan, Gustavo B. M. Mello, Laurence Habib, Paal Engelstad, Anis Yazidi","doi":"10.1145/3638779","DOIUrl":"https://doi.org/10.1145/3638779","url":null,"abstract":"<p>In this paper, we present a new propagation paradigm based on the principle of Hyperlink-Induced Topic Search (HITS) algorithm. The HITS algorithm utilizes the concept of a ”self-reinforcing” relationship of authority-hub. Using HITS, the centrality of nodes is determined via repeated updates of authority-hub scores that converge to a stationary distribution. Unlike PageRank-based propagation methods, which rely solely on the idea of authorities (in-links), HITS considers the relevance of both authorities (in-links) and hubs (out-links), thereby allowing for a more informative graph learning process. To segregate node prediction and propagation, we use a Multilayer Perceptron (MLP) in combination with a HITS-based propagation approach and propose two models; HITS-GNN and HITS-GNN+. We provided additional validation of our models’ efficacy by performing an ablation study to assess the performance of authority-hub in independent models. Moreover, the effect of the main hyper-parameters and normalization is also analyzed to uncover how these techniques influence the performance of our models. Extensive experimental results indicate that the proposed approach significantly improves baseline methods on the graph (citation network) benchmark datasets by a decent margin for semi-supervised node classification, which can aid in predicting the categories (labels) of scientific articles not exclusively based on their content but also based on the type of articles they cite.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"33 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"X-distribution: Retraceable Power-Law Exponent of Complex Networks","authors":"Pradumn Kumar Pandey, Aikta Arya, Akrati Saxena","doi":"10.1145/3639413","DOIUrl":"https://doi.org/10.1145/3639413","url":null,"abstract":"<p>Network modeling has been explored extensively by means of theoretical analysis as well as numerical simulations for Network Reconstruction (NR). The network reconstruction problem requires the estimation of the power-law exponent (<i>γ</i>) of a given input network. Thus, the effectiveness of the NR solution depends on the accuracy of the calculation of <i>γ</i>. In this article, we re-examine the degree distribution-based estimation of <i>γ</i>, which is not very accurate due to approximations. We propose <b>X</b>-distribution, which is more accurate as compared to degree distribution. Various state-of-the-art network models, including CPM, NRM, RefOrCite2, BA, CDPAM, and DMS, are considered for simulation purposes, and simulated results support the proposed claim. Further, we apply <b>X</b>-distribution over several real-world networks to calculate their power-law exponents, which differ from those calculated using respective degree distributions. It is observed that <b>X</b>-distributions exhibit more linearity (straight line) on the log-log scale as compared to degree distributions. Thus, <b>X</b>-distribution is more suitable for the evaluation of power-law exponent using linear fitting (on the log-log scale). The MATLAB implementation of power-law exponent (<i>γ</i>) calculation using <b>X</b>-distribution for different network models, and the real-world datasets used in our experiments are available here: https://github.com/Aikta-Arya/X-distribution-Retraceable-Power-Law-Exponent-of-Complex-Networks.git\u0000</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"6 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diverse Structure-aware Relation Representation in Cross-Lingual Entity Alignment","authors":"Yuhong Zhang, Jianqing Wu, Kui Yu, Xindong Wu","doi":"10.1145/3638778","DOIUrl":"https://doi.org/10.1145/3638778","url":null,"abstract":"<p>Cross-lingual entity alignment (CLEA) aims to find equivalent entity pairs between knowledge graphs (KG) in different languages. It is an important way to connect heterogeneous KGs and facilitate knowledge completion. Existing methods have found that incorporating relations into entities can effectively improve KG representation and benefit entity alignment, and these methods learn relation representation depending on entities, which cannot capture the diverse structures of relations. However, multiple relations in KG form diverse structures, such as adjacency structure and ring structure. This diversity of relation structures makes the relation representation challenging. Therefore, we propose to construct the weighted line graphs to model the diverse structures of relations and learn relation representation independently from entities. Especially, owing to the diversity of adjacency structures and ring structures, we propose to construct adjacency line graph and ring line graph respectively to model the structures of relations and to further improve entity representation. In addition, to alleviate the hubness problem in alignment, we introduce the optimal transport into alignment and compute the distance matrix in a different way. From a global perspective, we calculate the optimal 1-to-1 alignment bi-directionally to improve the alignment accuracy. Experimental results on two benchmark datasets show that our proposed method significantly outperforms state-of-the-art CLEA methods in both supervised and unsupervised manners.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}