{"title":"Uplift modeling with quasi-loss-functions","authors":"Jinping Hu, Evert de Haan, Bernd Skiera","doi":"10.1007/s10618-024-01042-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01042-x","url":null,"abstract":"<p>Uplift modeling, also referred to as heterogeneous treatment effect estimation, is a machine learning technique utilized in marketing for estimating the incremental impact of treatment on the response of each customer. Uplift models face a fundamental challenge in causal inference because the variable of interest (i.e., the uplift itself) remains unobservable. As a result, popular uplift models (such as meta-learners and uplift trees) do not incorporate loss functions for uplifts in their algorithms. This article addresses that gap by proposing uplift models with quasi-loss functions (UpliftQL models), which separately use four specially designed quasi-loss functions for uplift estimation in algorithms. Using simulated data, our analysis reveals that, on average, 55% (34%) of the top five models from a set of 14 are UpliftQL models for binary (continuous) outcomes. Further empirical data analysis shows that over 60% of the top-performing models are consistently UpliftQL models.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"23 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis
{"title":"Modeling the impact of out-of-schema questions in task-oriented dialog systems","authors":"Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis","doi":"10.1007/s10618-024-01039-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01039-6","url":null,"abstract":"<p>Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"43 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving graph-based recommendation with unraveled graph learning","authors":"Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen","doi":"10.1007/s10618-024-01038-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01038-7","url":null,"abstract":"<p>Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the <i>balance of the embedding from each layer in the output embedding</i>. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named <i>Unraveled Graph Contrastive Learning (UGCL)</i>, which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141259415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01025-y","url":null,"abstract":"<p>The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the <i>k</i>-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (<i>k</i>-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner
{"title":"Bias-aware ranking from pairwise comparisons","authors":"Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner","doi":"10.1007/s10618-024-01024-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01024-z","url":null,"abstract":"<p>Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of <i>ranking people (or items) from pairwise comparisons</i>. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"5 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
{"title":"LoCoMotif: discovering time-warped motifs in time series","authors":"Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel","doi":"10.1007/s10618-024-01032-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01032-z","url":null,"abstract":"<p>Time series motif discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi
{"title":"On the impact of multi-dimensional local differential privacy on fairness","authors":"Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi","doi":"10.1007/s10618-024-01031-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01031-0","url":null,"abstract":"<p>Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., <i>multi-dimensional data</i>) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high <span>(epsilon)</span>), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li
{"title":"Effective interpretable learning for large-scale categorical data","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li","doi":"10.1007/s10618-024-01030-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01030-1","url":null,"abstract":"<p>Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. <span>wide learning</span>). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or <span>ANN</span> (aka. <span>deep learning</span>). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. <span>factorized learning</span>). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, <span>WD</span>-Learning, <span>xdeepFM</span>, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—<span>WBDF</span>-Learning, based on the combination of <span>wide</span>, <span>deep</span>, <span>factorization</span>, and a newly introduced component named <span>Broad Interaction network</span> (<span>BIN</span>). <span>BIN</span> is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with <span>wide</span>, <span>deep</span> and <span>factorized</span> parts. We denote the learning of <span>BIN</span> parameters as <span>broad learning</span>. Additionally, the parameters of <span>BIN</span> are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from <span>BIN</span>, which can facilitate learning and provides a framework for <i>knowledge-guided machine learning</i>. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, <span>WBDF</span> framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali
{"title":"WaveLSea: helping experts interactively explore pattern mining search spaces","authors":"Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali","doi":"10.1007/s10618-024-01037-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01037-8","url":null,"abstract":"<p>This article presents the method Wave Top-k Random-d Lineage Search (WaveLSea) which guides an expert through data mining results according to her interest. The method exploits expert feedback, combined with the relation between patterns to spread the expert’s interest. It avoids the typical feature definition step commonly used in interactive data mining which limits the flexibility of the discovery process. We empirically demonstrate that WaveLSea returns the most relevant results for the user’s subjective interest. Even with imperfect feedback, WaveLSea behavior remains robust as it primarily still delivers most interesting results during experiments on graph-structured data. In order to assess the robustness of the method we design novel oracles called soothsayers giving imperfect feedback. Finally, we complete our quantitative study with a qualitative study using a user interface to evaluate WaveLSea.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"98 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy
{"title":"Active learning with biased non-response to label requests","authors":"Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy","doi":"10.1007/s10618-024-01026-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01026-x","url":null,"abstract":"<p>Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the <i>Upper Confidence Bound of the Expected Utility (UCB-EU)</i>–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"36 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}