Data Mining and Knowledge Discovery最新文献

Missing value replacement in strings and applications. 在字符串和应用程序中缺少值替换。

IF 2.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2025-01-01 Epub Date: 2025-01-22 DOI: 10.1007/s10618-024-01074-3

Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering

{"title":"Missing value replacement in strings and applications.","authors":"Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering","doi":"10.1007/s10618-024-01074-3","DOIUrl":"10.1007/s10618-024-01074-3","url":null,"abstract":"Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 2","pages":"12"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11754389/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143048707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FRUITS: feature extraction using iterated sums for time series classification FRUITS：利用迭代和进行时间序列分类的特征提取

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-09-14 DOI: 10.1007/s10618-024-01068-1

Joscha Diehl, Richard Krieg

引用次数: 0

Bounding the family-wise error rate in local causal discovery using Rademacher averages 利用拉德马赫平均值限定局部因果发现中的族内误差率

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-09-09 DOI: 10.1007/s10618-024-01069-0

Dario Simionato, Fabio Vandin

{"title":"Bounding the family-wise error rate in local causal discovery using Rademacher averages","authors":"Dario Simionato, Fabio Vandin","doi":"10.1007/s10618-024-01069-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01069-0","url":null,"abstract":"Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack 通过基于机器学习的再识别攻击评估匿名文件的披露风险

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-09-03 DOI: 10.1007/s10618-024-01066-3

Benet Manzanares-Salor, David Sánchez, Pierre Lison

{"title":"Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack","authors":"Benet Manzanares-Salor, David Sánchez, Pierre Lison","doi":"10.1007/s10618-024-01066-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01066-3","url":null,"abstract":"The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient learning with projected histograms 利用投影直方图进行高效学习

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-09-01 DOI: 10.1007/s10618-024-01063-6

Zhanliang Huang, Ata Kabán, Henry Reeve

{"title":"Efficient learning with projected histograms","authors":"Zhanliang Huang, Ata Kabán, Henry Reeve","doi":"10.1007/s10618-024-01063-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01063-6","url":null,"abstract":"High dimensional learning is a perennial problem due to challenges posed by the “curse of dimensionality”; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Opinion dynamics in social networks incorporating higher-order interactions 包含高阶互动的社交网络中的舆论动态

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-08-30 DOI: 10.1007/s10618-024-01064-5

Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen

{"title":"Opinion dynamics in social networks incorporating higher-order interactions","authors":"Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen","doi":"10.1007/s10618-024-01064-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01064-5","url":null,"abstract":"The issue of opinion sharing and formation has received considerable attention in the academic literature, and a few models have been proposed to study this problem. However, existing models are limited to the interactions among nearest neighbors, with those second, third, and higher-order neighbors only considered indirectly, despite the fact that higher-order interactions occur frequently in real social networks. In this paper, we develop a new model for opinion dynamics by incorporating long-range interactions based on higher-order random walks that can explicitly tune the degree of influence of higher-order neighbor interactions. We prove that the model converges to a fixed opinion vector, which may differ greatly from those models without higher-order interactions. Since direct computation of the equilibrium opinion is computationally expensive, which involves the operations of huge-scale matrix multiplication and inversion, we design a theoretically convergence-guaranteed estimation algorithm that approximates the equilibrium opinion vector nearly linearly in both space and time with respect to the number of edges in the graph. We conduct extensive experiments on various social networks, demonstrating that the new algorithm is both highly efficient and effective.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"138 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach 预测不确定性的模型无关变量重要性：基于熵的方法

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-08-29 DOI: 10.1007/s10618-024-01070-7

Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger

{"title":"Model-agnostic variable importance for predictive uncertainty: an entropy-based approach","authors":"Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger","doi":"10.1007/s10618-024-01070-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01070-7","url":null,"abstract":"In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model’s level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model’s predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels Detach-ROCKET：利用随机卷积核进行时间序列分类的顺序特征选择

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-08-20 DOI: 10.1007/s10618-024-01062-7

Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén

{"title":"Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels","authors":"Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén","doi":"10.1007/s10618-024-01062-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01062-7","url":null,"abstract":"Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian network Motifs for reasoning over heterogeneous unlinked datasets 用于推理异构非链接数据集的贝叶斯网络动机

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-08-17 DOI: 10.1007/s10618-024-01054-7

Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver

{"title":"Bayesian network Motifs for reasoning over heterogeneous unlinked datasets","authors":"Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver","doi":"10.1007/s10618-024-01054-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01054-7","url":null,"abstract":"Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"125 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regularization-based methods for ordinal quantification 基于正则化的序量化方法

IF 4.8 3区计算机科学

Data Mining and Knowledge Discovery Pub Date : 2024-08-15 DOI: 10.1007/s10618-024-01067-2

Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz

{"title":"Regularization-based methods for ordinal quantification","authors":"Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz","doi":"10.1007/s10618-024-01067-2","DOIUrl":"https://doi.org/10.1007/s10618-024-01067-2","url":null,"abstract":"Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of (n>2) classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"75 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0