{"title":"Semiparametric estimation of average treatment effects in observational studies","authors":"Jun Wang, Yujiao Guo","doi":"10.1002/sam.11688","DOIUrl":"https://doi.org/10.1002/sam.11688","url":null,"abstract":"We propose a semiparametric method to estimate average treatment effects in observational studies based on the assumption of unconfoundedness. Assume that the propensity score model and outcome model are a general single index model, which are estimated by the kernel method and the unknown index parameter is estimated via linearized maximum rank correlation method. The proposed estimator is computationally tractable, allows for large dimension covariates and not involves the approximation of link functions. We showed that the proposed estimator is consistent and asymptotically normally distributed. In general, the proposed estimator is superior to existing methods when the model is incorrectly specified. We also provide an empirical analysis on the average treatment effect and average treatment effect on the treated of 401(k) eligibility on net financial assets.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"133 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prior effective sample size for exponential family distributions with multiple parameters","authors":"Ryota Tamanoi","doi":"10.1002/sam.11685","DOIUrl":"https://doi.org/10.1002/sam.11685","url":null,"abstract":"The setting of priors is an important issue in Bayesian analysis. In particular, when external information is applied, a prior with too much information can dominate the posterior inferences. To prevent this effect, the effective sample size (ESS) can be used. Various ESSs have been proposed recently; however, all have the problem of limiting the applicable prior distributions. For example, one ESS can only be used with a prior that can be approximated by a normal distribution, and another ESS cannot be applied when the parameters are multidimensional. We propose an ESS to be applied to more prior distributions when the sampling model belongs to an exponential family (including the normal model and logistic regression models). This ESS has the predictive consistency and can be used with multidimensional parameters. It is confirmed from normally distributed data with the Student's‐<jats:italic>t</jats:italic> priors that this ESS behaves as well as an existing predictively consistent ESS for one‐parameter exponential families. As examples of multivariate parameters, ESSs for linear and logistic regression models are also discussed.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo
{"title":"Density estimation via measure transport: Outlook for applications in the biological sciences","authors":"Vanessa López‐Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo","doi":"10.1002/sam.11687","DOIUrl":"https://doi.org/10.1002/sam.11687","url":null,"abstract":"One among several advantages of measure transport methods is that they allow or a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"10 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Individualized image region detection with total variation","authors":"Sanyou Wu, Fuying Wang, Long Feng","doi":"10.1002/sam.11684","DOIUrl":"https://doi.org/10.1002/sam.11684","url":null,"abstract":"Medical image data have emerged to be an indispensable component of modern medicine. Different from many general image problems that focus on outcome prediction or image recognition, medical image analysis pays more attention to model interpretation. For instance, given a list of medical images and corresponding labels of patients' health status, it is often of greater importance to identify the image regions that could differentiate the outcome status, compared to simply predicting labels of new images. Moreover, medical image data often demonstrate strong individual heterogeneity. In other words, the image regions associated with an outcome could be different across patients. As a consequence, the traditional one‐model‐fits‐all approach not only omits patient heterogeneity but also possibly leads to misleading or even wrong conclusions. In this article, we introduce a novel statistical framework to detect individualized regions that are associated with a binary outcome, that is, whether a patient has a certain disease or not. Moreover, we propose a total variation‐based penalization for individualized image region detection under a local label‐free scenario. Considering that local labeling is often difficult to obtain for medical image data, our approach may potentially have a wider range of applications in medical research. The effectiveness of our proposed approach is validated by two real histopathology databases: Colon Cancer and Camelyon16.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"105 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The analysis of association rules: Latent class analysis","authors":"Ron S. Kenett, Chris Gotwalt","doi":"10.1002/sam.11686","DOIUrl":"https://doi.org/10.1002/sam.11686","url":null,"abstract":"Association rules are used to extract information from transactional databases with a collection of items also called “tokens” or “words.” The aim of association rule analysis is to indicate what and how items go with what items in a set of transactions called “documents.” This approach is used in the analysis of text records, of blogs in social media and of shopping baskets. We present here an approach to analyze documents using latent class analysis (LCA) clustering of document term matrices. A document term matrix (DTM) consists of rows referring to documents and columns corresponding to items. In binary weights, “1” indicates the presence of a term in a document and “0” otherwise. The clustering of similar documents provides stratified data sets used to enhance the interpretability of measures of interest such as lift, odds ratios and relative linkage disequilibrium. The article demonstrates the approach with two case studies. A first example consists of comments recorded in a survey aimed at pet owners. A second, much larger example, is based on online reviews to crocs sandals. Association rules describe combinations of terms in the pet survey and crocs reviews. In Section 3, we compute, for these case studies, association rule measures of interest defined in Section 2. We first introduce the case studies to motivate the methods proposed here. In Section 4, we provide a new approach with an enhanced interpretations of measures such as lift by comparing them across clusters derived from an LCA of the DTM. A key result is the application of clustered data in analyzing observational data. This enhances generalizability and interpretability of findings from text analytics. The article concludes with a discussion in Section 5.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"104 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Yu-Zhu, Wu Chun-Ho, Tai Ling-Nan, Mian Zhi-Bao, Tian Mao-Zai
{"title":"Bayesian relative composite quantile regression approach of ordinal latent regression model with L1/2 regularization","authors":"Tian Yu-Zhu, Wu Chun-Ho, Tai Ling-Nan, Mian Zhi-Bao, Tian Mao-Zai","doi":"10.1002/sam.11683","DOIUrl":"https://doi.org/10.1002/sam.11683","url":null,"abstract":"Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian <span data-altimg=\"/cms/asset/27e745bc-8e93-4391-8ba3-d551069a4246/sam11683-math-0003.png\"></span><math altimg=\"urn:x-wiley:19321864:media:sam11683:sam11683-math-0003\" display=\"inline\" location=\"graphic/sam11683-math-0003.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<msub>\u0000<mi>L</mi>\u0000<mrow>\u0000<mn>1</mn>\u0000<mo stretchy=\"false\">/</mo>\u0000<mn>2</mn>\u0000</mrow>\u0000</msub>\u0000</mrow>\u0000$$ {L}_{1/2} $$</annotation>\u0000</semantics></math> penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"207 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eugene Laska, Ziqiang Lin, Carole Siegel, Charles Marmar
{"title":"A treeless absolutely random forest with closed‐form estimators of expected proximities","authors":"Eugene Laska, Ziqiang Lin, Carole Siegel, Charles Marmar","doi":"10.1002/sam.11678","DOIUrl":"https://doi.org/10.1002/sam.11678","url":null,"abstract":"We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At <jats:italic>every node</jats:italic>, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature <jats:italic>in the root node</jats:italic>, does not change. This enables closed‐form estimators of parameters, such as pairwise proximities, to be obtained <jats:italic>without having to grow a forest</jats:italic>. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, the algorithm will split units whose feature vectors are far apart and keep together units whose feature vectors are similar. Thus, the underlying structure of the data drives the growth of the tree. The expected value of pairwise proximities is obtained for three pathway functions. One, a <jats:italic>completely common pathway</jats:italic> function, is an indicator of whether a pair of units follow the same path from the root to the leaf node. The properties of TARF‐based proximity estimators for clustering and classification are compared to other methods in eight real‐world datasets and in simulations. Results show substantial performance and computing efficiencies of particular value for large datasets.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"37 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transfer learning under the Cox model with interval‐censored data","authors":"Mengqi Xie, Tao Hu, Jie Zhou","doi":"10.1002/sam.11680","DOIUrl":"https://doi.org/10.1002/sam.11680","url":null,"abstract":"Transfer learning, focusing on information borrowing to address limited sample size issues, has gained increasing attention in recent years. Our method aims to utilize data from other population groups as a complement to enhance risk factor discernment and failure time prediction among underrepresented subgroups. However, a literature gap exists in effective knowledge transfer from the source to the target for risk assessment with interval‐censored data while accommodating population incomparability and privacy constraints. Our objective is to bridge this gap by developing a transfer learning approach under the Cox proportional hazards model. We introduce the tuning‐free Trans‐Cox‐MIC algorithm, enabling adaptable information sharing in regression coefficients and baseline hazards, while ensuring computational efficiency. Our approach accommodates covariate distribution shifts, coefficient variations, and baseline hazard discrepancies. Extensive simulations showcase the method's accuracy, robustness, and efficiency. Application to the prostate cancer screening data demonstrates enhanced risk estimation precision and predictive performance in the African American population.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"58 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Randomized multiarm bandits: An improved adaptive data collection method","authors":"Zhigen Zhao, Tong Wang, Bo Ji","doi":"10.1002/sam.11681","DOIUrl":"https://doi.org/10.1002/sam.11681","url":null,"abstract":"In many scientific experiments, multiarmed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alternative approach called randomized multiarm bandits (rMAB). This combines a randomization step with a chosen MAB algorithm, and by selecting the randomization probability appropriately, optimal regret can be achieved asymptotically. Numerical evidence shows that the bias of the sample mean based on the rMAB is much smaller than that of other methods. The test statistic and confidence interval produced by this method also perform much better than its competitors.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"207 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140571179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}