Statistical Analysis and Data Mining最新文献

筛选
英文 中文
Reduced Rank Ridge Regression and Its Kernel Extensions. 简化秩岭回归及其核扩展。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2011-12-01 Epub Date: 2011-10-07 DOI: 10.1002/sam.10138
Ashin Mukherjee, Ji Zhu
{"title":"Reduced Rank Ridge Regression and Its Kernel Extensions.","authors":"Ashin Mukherjee,&nbsp;Ji Zhu","doi":"10.1002/sam.10138","DOIUrl":"https://doi.org/10.1002/sam.10138","url":null,"abstract":"<p><p>In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10138","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30919516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Clustering Based on Periodicity in High-Throughput Time Course Data. 基于周期性的高通量时间课程数据聚类。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2011-12-01 DOI: 10.1002/sam.10137
Anna J Blackstock, Amita K Manatunga, Youngja Park, Dean P Jones, Tianwei Yu
{"title":"Clustering Based on Periodicity in High-Throughput Time Course Data.","authors":"Anna J Blackstock,&nbsp;Amita K Manatunga,&nbsp;Youngja Park,&nbsp;Dean P Jones,&nbsp;Tianwei Yu","doi":"10.1002/sam.10137","DOIUrl":"https://doi.org/10.1002/sam.10137","url":null,"abstract":"<p><p>Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31503030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data. 用于纵向高维数据的新型支持向量分类器及其在神经影像数据中的应用
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2011-12-01 DOI: 10.1002/sam.10141
Shuo Chen, F DuBois Bowman
{"title":"A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data.","authors":"Shuo Chen, F DuBois Bowman","doi":"10.1002/sam.10141","DOIUrl":"10.1002/sam.10141","url":null,"abstract":"<p><p>Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189187/pdf/nihms-629358.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Space-efficient tracking of persistent items in a massive data stream 大规模数据流中持久项的空间高效跟踪
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2011-07-11 DOI: 10.1145/2002259.2002294
Bibudh Lahiri, S. Tirthapura, J. Chandrashekar
{"title":"Space-efficient tracking of persistent items in a massive data stream","authors":"Bibudh Lahiri, S. Tirthapura, J. Chandrashekar","doi":"10.1145/2002259.2002294","DOIUrl":"https://doi.org/10.1145/2002259.2002294","url":null,"abstract":"Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur \"regularly\" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors.\u0000 We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2011-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77904624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification. 嵌入熵的序列支持向量回归用于SNP选择和疾病分类。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2011-06-01 DOI: 10.1002/sam.10110
Yulan Liang, Arpad Kelemen
{"title":"Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification.","authors":"Yulan Liang,&nbsp;Arpad Kelemen","doi":"10.1002/sam.10110","DOIUrl":"https://doi.org/10.1002/sam.10110","url":null,"abstract":"<p><p>Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10110","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29930336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Machine-Learning Approach to Detecting Unknown Bacterial Serovars. 检测未知细菌血清型的机器学习方法。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/sam.10085
Ferit Akova, Murat Dundar, V Jo Davisson, E Daniel Hirleman, Arun K Bhunia, J Paul Robinson, Bartek Rajwa
{"title":"A Machine-Learning Approach to Detecting Unknown Bacterial Serovars.","authors":"Ferit Akova,&nbsp;Murat Dundar,&nbsp;V Jo Davisson,&nbsp;E Daniel Hirleman,&nbsp;Arun K Bhunia,&nbsp;J Paul Robinson,&nbsp;Bartek Rajwa","doi":"10.1002/sam.10085","DOIUrl":"https://doi.org/10.1002/sam.10085","url":null,"abstract":"Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light‐scattering sensor recently developed for real‐time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unknown bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28‐class bacteria dataset and also on the benchmark 26‐class letter recognition dataset for further validation. The proposed approach is compared against state‐of‐the‐art involving density‐based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 289‐301, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30319662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Model selection procedure for high-dimensional data. 高维数据的模型选择程序。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/sam.10088
Yongli Zhang, Xiaotong Shen
{"title":"Model selection procedure for high-dimensional data.","authors":"Yongli Zhang,&nbsp;Xiaotong Shen","doi":"10.1002/sam.10088","DOIUrl":"https://doi.org/10.1002/sam.10088","url":null,"abstract":"<p><p>For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10088","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29500256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Discriminative frequent subgraph mining with optimality guarantees 具有最优性保证的判别频繁子图挖掘
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI: 10.1002/SAM.V3:5
Marisa Thoma, Hong Cheng, A. Gretton, Jiawei Han, H. Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan, Karsten M. Borgwardt
{"title":"Discriminative frequent subgraph mining with optimality guarantees","authors":"Marisa Thoma, Hong Cheng, A. Gretton, Jiawei Han, H. Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan, Karsten M. Borgwardt","doi":"10.1002/SAM.V3:5","DOIUrl":"https://doi.org/10.1002/SAM.V3:5","url":null,"abstract":"The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51496964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Multicategory Composite Least Squares Classifiers. 多类别复合最小二乘分类器。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2010-08-01 DOI: 10.1002/sam.10081
Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl
{"title":"Multicategory Composite Least Squares Classifiers.","authors":"Seo Young Park,&nbsp;Yufeng Liu,&nbsp;Dacheng Liu,&nbsp;Paul Scholl","doi":"10.1002/sam.10081","DOIUrl":"https://doi.org/10.1002/sam.10081","url":null,"abstract":"<p><p>Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29584342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database 基于大规模回归的模式发现:以筛选WHO全球药物安全数据库为例
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2010-08-01 DOI: 10.1002/SAM.V3:4
O. Caster, G. N. Norén, D. Madigan, A. Bate
{"title":"Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database","authors":"O. Caster, G. N. Norén, D. Madigan, A. Bate","doi":"10.1002/SAM.V3:4","DOIUrl":"https://doi.org/10.1002/SAM.V3:4","url":null,"abstract":"Most measures of interestingness for patterns of co-occurring events are based on data projections onto contingency tables for the events of primary interest. As an alternative, this article presents the first implementation of shrinkage logistic regression for large-scale pattern discovery, with an evaluation of its usefulness in real-world binary transaction data. Regression accounts for the impact of other covariates that may confound or otherwise distort associations. The application considered is international adverse drug reaction (ADR) surveillance, in which large collections of reports on suspected ADRs are screened for interesting reporting patterns worthy of clinical follow-up. Our results show that regression-based pattern discovery does offer practical advantages. Specifically it can eliminate false positives and false negatives due to other covariates. Furthermore, it identifies some established drug safety issues earlier than a measure based on contingency tables. While regression offers clear conceptual advantages, our results suggest that methods based on contingency tables will continue to play a key role in ADR surveillance, for two reasons: the failure of regression to identify some established drug safety concerns as early as the currently used measures, and the relative lack of transparency of the procedure to estimate the regression coefficients. This suggests shrinkage regression should be used in parallel to existing measures of interestingness in ADR surveillance and other large-scale pattern discovery applications. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 197-208, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51496096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信