Data Mining and Knowledge Discovery最新文献_第4页

Uplift modeling with quasi-loss-functions 用准损耗函数进行上浮建模

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-06-04 DOI: 10.1007/s10618-024-01042-x

Jinping Hu, Evert de Haan, Bernd Skiera

引用次数: 0

Modeling the impact of out-of-schema questions in task-oriented dialog systems 面向任务的对话系统中模式外问题的影响建模

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-06-04 DOI: 10.1007/s10618-024-01039-6

Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis

{"title":"Modeling the impact of out-of-schema questions in task-oriented dialog systems","authors":"Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis","doi":"10.1007/s10618-024-01039-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01039-6","url":null,"abstract":"Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"43 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving graph-based recommendation with unraveled graph learning 利用未揭示图学习改进基于图的推荐

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-06-02 DOI: 10.1007/s10618-024-01038-7

Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen

{"title":"Improving graph-based recommendation with unraveled graph learning","authors":"Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen","doi":"10.1007/s10618-024-01038-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01038-7","url":null,"abstract":"Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the balance of the embedding from each layer in the output embedding. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named Unraveled Graph Contrastive Learning (UGCL), which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141259415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A practical approach to novel class discovery in tabular data 在表格数据中发现新类别的实用方法

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-31 DOI: 10.1007/s10618-024-01025-y

Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine

{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01025-y","url":null,"abstract":"The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bias-aware ranking from pairwise comparisons 通过成对比较进行有偏差的排序

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-31 DOI: 10.1007/s10618-024-01024-z

Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner

{"title":"Bias-aware ranking from pairwise comparisons","authors":"Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner","doi":"10.1007/s10618-024-01024-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01024-z","url":null,"abstract":"Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of ranking people (or items) from pairwise comparisons. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"5 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LoCoMotif: discovering time-warped motifs in time series LoCoMotif：发现时间序列中的时间扭曲图案

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-30 DOI: 10.1007/s10618-024-01032-z

Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel

引用次数: 0

On the impact of multi-dimensional local differential privacy on fairness 多维局部差异隐私对公平性的影响

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-27 DOI: 10.1007/s10618-024-01031-0

Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi

{"title":"On the impact of multi-dimensional local differential privacy on fairness","authors":"Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi","doi":"10.1007/s10618-024-01031-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01031-0","url":null,"abstract":"Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high (epsilon)), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective interpretable learning for large-scale categorical data 针对大规模分类数据的有效可解释学习

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-27 DOI: 10.1007/s10618-024-01030-1

Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li

{"title":"Effective interpretable learning for large-scale categorical data","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li","doi":"10.1007/s10618-024-01030-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01030-1","url":null,"abstract":"Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. wide learning). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or ANN (aka. deep learning). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. factorized learning). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, WD-Learning, xdeepFM, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—WBDF-Learning, based on the combination of wide, deep, factorization, and a newly introduced component named Broad Interaction network (BIN). BIN is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with wide, deep and factorized parts. We denote the learning of BIN parameters as broad learning. Additionally, the parameters of BIN are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from BIN, which can facilitate learning and provides a framework for knowledge-guided machine learning. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, WBDF framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WaveLSea: helping experts interactively explore pattern mining search spaces WaveLSea：帮助专家交互式探索模式挖掘搜索空间

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-26 DOI: 10.1007/s10618-024-01037-8

Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali

引用次数: 0

Active learning with biased non-response to label requests 有偏差地不回应标签请求的主动学习

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-05-25 DOI: 10.1007/s10618-024-01026-x

Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy

{"title":"Active learning with biased non-response to label requests","authors":"Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy","doi":"10.1007/s10618-024-01026-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01026-x","url":null,"abstract":"Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the Upper Confidence Bound of the Expected Utility (UCB-EU)–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"36 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0