Statistical Analysis and Data Mining最新文献_第10页

Reduced Rank Ridge Regression and Its Kernel Extensions. 简化秩岭回归及其核扩展。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2011-12-01 Epub Date: 2011-10-07 DOI: 10.1002/sam.10138

Ashin Mukherjee, Ji Zhu

引用次数: 53

Clustering Based on Periodicity in High-Throughput Time Course Data. 基于周期性的高通量时间课程数据聚类。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2011-12-01 DOI: 10.1002/sam.10137

Anna J Blackstock, Amita K Manatunga, Youngja Park, Dean P Jones, Tianwei Yu

引用次数: 2

A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data. 用于纵向高维数据的新型支持向量分类器及其在神经影像数据中的应用

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2011-12-01 DOI: 10.1002/sam.10141

Shuo Chen, F DuBois Bowman

{"title":"A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data.","authors":"Shuo Chen, F DuBois Bowman","doi":"10.1002/sam.10141","DOIUrl":"10.1002/sam.10141","url":null,"abstract":"Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"604-611"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189187/pdf/nihms-629358.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Space-efficient tracking of persistent items in a massive data stream 大规模数据流中持久项的空间高效跟踪

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2011-07-11 DOI: 10.1145/2002259.2002294

Bibudh Lahiri, S. Tirthapura, J. Chandrashekar

引用次数: 19

Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification. 嵌入熵的序列支持向量回归用于SNP选择和疾病分类。

IF 2.1 4区数学

Statistical Analysis and Data Mining Pub Date : 2011-06-01 DOI: 10.1002/sam.10110

Yulan Liang, Arpad Kelemen

引用次数: 0

A Machine-Learning Approach to Detecting Unknown Bacterial Serovars. 检测未知细菌血清型的机器学习方法。

IF 2.1 4区数学

Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/sam.10085

Ferit Akova, Murat Dundar, V Jo Davisson, E Daniel Hirleman, Arun K Bhunia, J Paul Robinson, Bartek Rajwa

{"title":"A Machine-Learning Approach to Detecting Unknown Bacterial Serovars.","authors":"Ferit Akova, Murat Dundar, V Jo Davisson, E Daniel Hirleman, Arun K Bhunia, J Paul Robinson, Bartek Rajwa","doi":"10.1002/sam.10085","DOIUrl":"10.1002/sam.10085","url":null,"abstract":"Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light-scattering sensor recently developed for real-time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unmatched bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28-class bacteria dataset and also on the benchmark 26-class letter recognition dataset for further validation. The proposed approach is compared against state-of-the-art involving density-based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"289-301"},"PeriodicalIF":2.1,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3230886/pdf/nihms242307.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30319662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discriminative frequent subgraph mining with optimality guarantees 具有最优性保证的判别频繁子图挖掘

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/SAM.V3:5

Marisa Thoma, Hong Cheng, A. Gretton, Jiawei Han, H. Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan, Karsten M. Borgwardt

引用次数: 23

Model selection procedure for high-dimensional data. 高维数据的模型选择程序。

IF 2.1 4区数学

Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/sam.10088

Yongli Zhang, Xiaotong Shen

{"title":"Model selection procedure for high-dimensional data.","authors":"Yongli Zhang, Xiaotong Shen","doi":"10.1002/sam.10088","DOIUrl":"10.1002/sam.10088","url":null,"abstract":"For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"350-358"},"PeriodicalIF":2.1,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2992390/pdf/nihms-225711.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29500256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database 基于大规模回归的模式发现:以筛选WHO全球药物安全数据库为例

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2010-08-01 DOI: 10.1002/SAM.V3:4

O. Caster, G. N. Norén, D. Madigan, A. Bate

{"title":"Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database","authors":"O. Caster, G. N. Norén, D. Madigan, A. Bate","doi":"10.1002/SAM.V3:4","DOIUrl":"https://doi.org/10.1002/SAM.V3:4","url":null,"abstract":"Most measures of interestingness for patterns of co-occurring events are based on data projections onto contingency tables for the events of primary interest. As an alternative, this article presents the first implementation of shrinkage logistic regression for large-scale pattern discovery, with an evaluation of its usefulness in real-world binary transaction data. Regression accounts for the impact of other covariates that may confound or otherwise distort associations. The application considered is international adverse drug reaction (ADR) surveillance, in which large collections of reports on suspected ADRs are screened for interesting reporting patterns worthy of clinical follow-up. Our results show that regression-based pattern discovery does offer practical advantages. Specifically it can eliminate false positives and false negatives due to other covariates. Furthermore, it identifies some established drug safety issues earlier than a measure based on contingency tables. While regression offers clear conceptual advantages, our results suggest that methods based on contingency tables will continue to play a key role in ADR surveillance, for two reasons: the failure of regression to identify some established drug safety concerns as early as the currently used measures, and the relative lack of transparency of the procedure to estimate the regression coefficients. This suggests shrinkage regression should be used in parallel to existing measures of interestingness in ADR surveillance and other large-scale pattern discovery applications. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 197-208, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"22 1","pages":"197-208"},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51496096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Multicategory Composite Least Squares Classifiers. 多类别复合最小二乘分类器。

IF 2.1 4区数学

Statistical Analysis and Data Mining Pub Date : 2010-08-01 DOI: 10.1002/sam.10081

Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl

{"title":"Multicategory Composite Least Squares Classifiers.","authors":"Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl","doi":"10.1002/sam.10081","DOIUrl":"10.1002/sam.10081","url":null,"abstract":"Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 4","pages":"272-286"},"PeriodicalIF":2.1,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3015392/pdf/nihms-225876.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29584342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0