{"title":"A Bayesian model for estimating Sustainable Development Goal indicator 4.1.2: School completion rates","authors":"Ameer Dharamshi, Bilal Barakat, Leontine Alkema, Manos Antoninis","doi":"10.1111/rssc.12595","DOIUrl":"10.1111/rssc.12595","url":null,"abstract":"<p>Estimating school completion is crucial for monitoring Sustainable Development Goal (SDG) 4 on education. The recently introduced SDG indicator 4.1.2, defined as the percentage of children aged 3–5 years above the expected completion age of a given level of education that have completed the respective level, differs from enrolment indicators in that it relies primarily on household surveys. This introduces a number of challenges including gaps between survey waves, conflicting estimates, age misreporting and delayed completion. We introduce the Adjusted Bayesian Completion Rates (ABCR) model to address these challenges and produce the first complete and consistent time series for SDG indicator 4.1.2, by school level and sex, for 164 countries. Validation exercises indicate that the model appears well-calibrated and offers a meaningful improvement over simpler approaches in predictive performance. The ABCR model is now used by the United Nations to monitor completion rates for all countries with available survey data.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1822-1864"},"PeriodicalIF":1.6,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12595","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72417219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient estimation of the marginal mean of recurrent events","authors":"Giuliana Cortese, Thomas H. Scheike","doi":"10.1111/rssc.12586","DOIUrl":"10.1111/rssc.12586","url":null,"abstract":"<p>Recurrent events are often encountered in clinical and epidemiological studies where a terminal event is also observed. With recurrent events data it is of great interest to estimate the marginal mean of the cumulative number of recurrent events experienced prior to the terminal event. The standard nonparametric estimator was suggested in Cook and Lawless and further developed in Ghosh and Lin. We here investigate the efficiency of this estimator that, surprisingly, has not been studied before. We rewrite the standard estimator as an inverse probability of censoring weighted estimator. From this representation we derive an efficient augmented estimator using efficient estimation theory for right-censored data. We show that the standard estimator is efficient in settings with no heterogeneity. In other settings with different sources of heterogeneity, we show theoretically and by simulations that the efficiency can be greatly improved when an efficient augmented estimator based on dynamic predictions is employed, at no extra cost to robustness. The estimators are applied and compared to study the mean number of catheter-related bloodstream infections in heterogeneous patients with chronic intestinal failure who can possibly die, and the efficiency gain is highlighted in the resulting point-wise confidence intervals.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1787-1821"},"PeriodicalIF":1.6,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12586","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79223958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contour models for physical boundaries enclosing star-shaped and approximately star-shaped polygons","authors":"Hannah M. Director, Adrian E. Raftery","doi":"10.1111/rssc.12592","DOIUrl":"10.1111/rssc.12592","url":null,"abstract":"<p>Boundaries on spatial fields divide regions with particular features from surrounding background areas. Methods to identify boundary lines from interpolated spatial fields are well established. Less attention has been paid to how to model sequences of connected spatial points. Such models are needed for physical boundaries. For example, in the Arctic ocean, large contiguous areas are covered by sea ice, or frozen ocean water. We define the ice edge contour as the ordered sequences of spatial points that connect to form a line around set(s) of contiguous grid boxes with sea ice present. Polar scientists need to describe how this contiguous area behaves in present and historical data and under future climate change scenarios. We introduce the Gaussian Star-shaped Contour Model (GSCM) for modelling boundaries represented as connected sequences of spatial points such as the sea ice edge. GSCMs generate sequences of spatial points via generating sets of distances in various directions from a fixed starting point. The GSCM can be applied to contours that enclose regions that are star-shaped polygons or approximately star-shaped polygons. Metrics are introduced to assess the extent to which a polygon deviates from star-shapedness. Simulation studies illustrate the performance of the GSCM in different situations.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1688-1720"},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89579451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feifei Wang, Danyang Huang, Tianchen Gao, Shuyuan Wu, Hansheng Wang
{"title":"Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets","authors":"Feifei Wang, Danyang Huang, Tianchen Gao, Shuyuan Wu, Hansheng Wang","doi":"10.1111/rssc.12597","DOIUrl":"10.1111/rssc.12597","url":null,"abstract":"<p>Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute <i>non-uniform</i> sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with <i>uniform</i> probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1753-1786"},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88578893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian Flint, Nick Golding, Peter Vesk, Yan Wang, Aihua Xia
{"title":"The saturated pairwise interaction Gibbs point process as a joint species distribution model","authors":"Ian Flint, Nick Golding, Peter Vesk, Yan Wang, Aihua Xia","doi":"10.1111/rssc.12596","DOIUrl":"10.1111/rssc.12596","url":null,"abstract":"<p>In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in complex ecosystems. Based on the existing literature, we provide an easy to implement fitting procedure as well as a technique to make inference for the model parameters. We also prove that under certain hypotheses the point process is locally stable, which allows us to use the well-known ‘coupling from the past’ algorithm to draw samples from the model. Different numerical experiments show the robustness of the model. We study three different ecological data sets, demonstrating in each one that our model helps disentangle competing ecological effects on species' distribution.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1721-1752"},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12596","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89252881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Score test for assessing the conditional dependence in latent class models and its application to record linkage","authors":"Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis","doi":"10.1111/rssc.12590","DOIUrl":"10.1111/rssc.12590","url":null,"abstract":"<p>The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1663-1687"},"PeriodicalIF":1.6,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82870632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging network structure to improve pooled testing efficiency","authors":"Daniel K. Sewell","doi":"10.1111/rssc.12594","DOIUrl":"10.1111/rssc.12594","url":null,"abstract":"<p>Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent-based model designed to simulate the spread of SARS-CoV-2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1648-1662"},"PeriodicalIF":1.6,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0b/29/RSSC-71-1648.PMC9826453.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10257743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-parametric time-to-event modelling of lengths of hospital stays","authors":"Yang Li, Hao Liu, Xiaoshen Wang, Wanzhu Tu","doi":"10.1111/rssc.12593","DOIUrl":"10.1111/rssc.12593","url":null,"abstract":"<p>Length of stay (LOS) is an essential metric for the quality of hospital care. Published works on LOS analysis have primarily focused on skewed LOS distributions and the influences of patient diagnostic characteristics. Few authors have considered the events that terminate a hospital stay: Both successful discharge and death could end a hospital stay but with completely different implications. Modelling the time to the first occurrence of discharge or death obscures the true nature of LOS. In this research, we propose a structure that simultaneously models the probabilities of discharge and death. The model has a flexible formulation that accounts for both additive and multiplicative effects of factors influencing the occurrence of death and discharge. We present asymptotic properties of the parameter estimates so that valid inference can be performed for the parametric as well as nonparametric model components. Simulation studies confirmed the good finite-sample performance of the proposed method. As the research is motivated by practical issues encountered in LOS analysis, we analysed data from two real clinical studies to showcase the general applicability of the proposed model.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1623-1647"},"PeriodicalIF":1.6,"publicationDate":"2022-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/e7/b9/RSSC-71-1623.PMC9826400.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juhee Lee, Peter F. Thall, Bora Lim, Pavlos Msaouel
{"title":"Utility-based Bayesian personalized treatment selection for advanced breast cancer","authors":"Juhee Lee, Peter F. Thall, Bora Lim, Pavlos Msaouel","doi":"10.1111/rssc.12582","DOIUrl":"10.1111/rssc.12582","url":null,"abstract":"<p>A Bayesian method is proposed for personalized treatment selection in settings where data are available from a randomized clinical trial with two or more outcomes. The motivating application is a randomized trial that compared letrozole plus bevacizumab to letrozole alone as first-line therapy for hormone receptor-positive advanced breast cancer. The combination treatment arm had larger median progression-free survival time, but also a higher rate of severe toxicities. This suggests that the risk-benefit trade-off between these two outcomes should play a central role in selecting each patient's treatment, particularly since older patients are less likely to tolerate severe toxicities. To quantify the desirability of each possible outcome combination for an individual patient, we elicited from breast cancer oncologists a utility function that varied with age. The utility was used as an explicit criterion for quantifying risk-benefit trade-offs when making personalized treatment selections. A Bayesian nonparametric multivariate regression model with a dependent Dirichlet process prior was fit to the trial data. Under the fitted model, a new patient's treatment can be selected based on the posterior predictive utility distribution. For the breast cancer trial dataset, the optimal treatment depends on the patient's age, with the combination preferable for patients 70 years or younger and the single agent preferable for patients older than 70.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1605-1622"},"PeriodicalIF":1.6,"publicationDate":"2022-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10116488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring diachronic sense change: New models and Monte Carlo methods for Bayesian inference","authors":"Schyan Zafar, Geoff K. Nicholls","doi":"10.1111/rssc.12591","DOIUrl":"10.1111/rssc.12591","url":null,"abstract":"<p>In a bag-of-words model, the <i>senses</i> of a word with multiple meanings, for example ‘bank’ (used either in a river-bank or an institution sense), are represented as probability distributions over context words, and sense prevalence is represented as a probability distribution over senses. Both of these may change with time. Modelling and measuring this kind of sense change are challenging due to the typically high-dimensional parameter space and sparse datasets. A recently published corpus of ancient Greek texts contains expert-annotated sense labels for selected target words. Automatic sense-annotation for the word ‘kosmos’ (meaning decoration, order or world) has been used as a test case in recent work with related generative models and Monte Carlo methods. We adapt an existing generative sense change model to develop a simpler model for the main effects of sense and time, and give Markov Chain Monte Carlo methods for Bayesian inference on all these models that are more efficient than existing methods. We carry out automatic sense-annotation of snippets containing ‘kosmos’ using our model, and measure the time-evolution of its three senses and their prevalence. As far as we are aware, ours is the first analysis of this data, within the class of generative models we consider, that quantifies uncertainty and returns credible sets for evolving sense prevalence in good agreement with those given by expert annotation.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1569-1604"},"PeriodicalIF":1.6,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12591","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82774142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}