{"title":"Role of survey response rates on valid inference: an application to HIV prevalence estimates.","authors":"Miguel Marino, Marcello Pagano","doi":"10.1186/s12982-018-0074-x","DOIUrl":"10.1186/s12982-018-0074-x","url":null,"abstract":"<p><strong>Background: </strong>Nationally-representative surveys suggest that females have a higher prevalence of HIV than males in most African countries. Unfortunately, these results are made on the basis of surveys with non-ignorable missing data. This study evaluates the impact that differential survey nonresponse rates between males and females can have on the point estimate of the HIV prevalence ratio of these two classifiers.</p><p><strong>Methods: </strong>We study 29 Demographic and Health Surveys (DHS) from 2001 to 2010. Instead of employing often used multiple imputation models with a Missing at Random assumption that may not hold in this setting, we assess the effect of ignoring the information contained in the missing HIV information for males and females through three proposed statistical measures. These measures can be used in settings where the interest is comparing the prevalence of a disease between two groups. The proposed measures do not utilize parametric models and can be implemented by researchers of any level. They are: (1) an upper bound on the potential bias of the usual practise of using reported HIV prevalence estimates that ignore subjects who have missing HIV outcomes. (2) Plausible range intervals to account for nonresponses, without any additional parametric modeling assumptions. (3) Prevalence ratio inflation factors to correct the point estimate of the HIV prevalence ratio, if estimates of nonresponders' HIV prevalences were known.</p><p><strong>Results: </strong>In 86% of countries, males have higher upper bounds of HIV prevalence than females, this is consonant with males possibly having higher infection rates than females. Additionally, 74% of surveys have a <i>plausible</i> range that crosses 1.0, suggesting a plausible equivalence between male and female HIV prevalences.</p><p><strong>Conclusions: </strong>It is quite reasonable to conclude that there is so much DHS nonresponse in evaluating the HIV status question, that existing data is plausibly generated by the situation where the virus is equally distributed between the sexes.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"6"},"PeriodicalIF":3.6,"publicationDate":"2018-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5839032/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35903247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert W Eyre, Thomas House, F Xavier Gómez-Olivé, Frances E Griffiths
{"title":"Modelling fertility in rural South Africa with combined nonlinear parametric and semi-parametric methods.","authors":"Robert W Eyre, Thomas House, F Xavier Gómez-Olivé, Frances E Griffiths","doi":"10.1186/s12982-018-0073-y","DOIUrl":"https://doi.org/10.1186/s12982-018-0073-y","url":null,"abstract":"<p><strong>Background: </strong>Central to the study of populations, and therefore to the analysis of the development of countries undergoing major transitions, is the calculation of fertility patterns and their dependence on different variables such as age, education, and socio-economic status. Most epidemiological research on these matters rely on the often unjustified assumption of (generalised) linearity, or alternatively makes a parametric assumption (e.g. for age-patterns).</p><p><strong>Methods: </strong>We consider nonlinearity of fertility in the covariates by combining an established nonlinear parametric model for fertility over age with nonlinear modelling of fertility over other covariates. For the latter, we use the semi-parametric method of Gaussian process regression which is a popular methodology in many fields including machine learning, computer science, and systems biology. We applied the method to data from the Agincourt Health and Socio-Demographic Surveillance System, annual census rounds performed on a poor rural region of South Africa since 1992, to analyse fertility patterns over age and socio-economic status.</p><p><strong>Results: </strong>We capture a previously established age-pattern of fertility, whilst being able to more robustly model the relationship between fertility and socio-economic status without unjustified a priori assumptions of linearity. Peak fertility over age is shown to be increasing over time, as well as for adolescents but not for those later in life for whom fertility is generally decreasing over time.</p><p><strong>Conclusions: </strong>Combining Gaussian process regression with nonlinear parametric modelling of fertility over age allowed for the incorporation of further covariates into the analysis without needing to assume a linear relationship. This enabled us to provide further insights into the fertility patterns of the Agincourt study area, in particular the interaction between age and socio-economic status.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"5"},"PeriodicalIF":2.3,"publicationDate":"2018-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-018-0073-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35885842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew R Grigsby, Junrui Di, Andrew Leroux, Vadim Zipunnikov, Luo Xiao, Ciprian Crainiceanu, William Checkley
{"title":"Novel metrics for growth model selection.","authors":"Matthew R Grigsby, Junrui Di, Andrew Leroux, Vadim Zipunnikov, Luo Xiao, Ciprian Crainiceanu, William Checkley","doi":"10.1186/s12982-018-0072-z","DOIUrl":"10.1186/s12982-018-0072-z","url":null,"abstract":"<p><strong>Background: </strong>Literature surrounding the statistical modeling of childhood growth data involves a diverse set of potential models from which investigators can choose. However, the lack of a comprehensive framework for comparing non-nested models leads to difficulty in assessing model performance. This paper proposes a framework for comparing non-nested growth models using novel metrics of predictive accuracy based on modifications of the mean squared error criteria.</p><p><strong>Methods: </strong>Three metrics were created: normalized, age-adjusted, and weighted mean squared error (MSE). Predictive performance metrics were used to compare linear mixed effects models and functional regression models. Prediction accuracy was assessed by partitioning the observed data into training and test datasets. This partitioning was constructed to assess prediction accuracy for backward (i.e., early growth), forward (i.e., late growth), in-range, and on new-individuals. Analyses were done with height measurements from 215 Peruvian children with data spanning from near birth to 2 years of age.</p><p><strong>Results: </strong>Functional models outperformed linear mixed effects models in all scenarios tested. In particular, prediction errors for functional concurrent regression (FCR) and functional principal component analysis models were approximately 6% lower when compared to linear mixed effects models. When we weighted subject-specific MSEs according to subject-specific growth rates during infancy, we found that FCR was the best performer in all scenarios.</p><p><strong>Conclusion: </strong>With this novel approach, we can quantitatively compare non-nested models and weight subgroups of interest to select the best performing growth model for a particular application or problem at hand.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"4"},"PeriodicalIF":2.3,"publicationDate":"2018-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5824542/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35865435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nandita Perumal, Daniel E Roth, Johnna Perdrizet, Aluísio J D Barros, Iná S Santos, Alicia Matijasevich, Diego G Bassani
{"title":"Effect of correcting for gestational age at birth on population prevalence of early childhood undernutrition.","authors":"Nandita Perumal, Daniel E Roth, Johnna Perdrizet, Aluísio J D Barros, Iná S Santos, Alicia Matijasevich, Diego G Bassani","doi":"10.1186/s12982-018-0070-1","DOIUrl":"10.1186/s12982-018-0070-1","url":null,"abstract":"<p><strong>Background: </strong>Postmenstrual and/or gestational age-corrected age (CA) is required to apply child growth standards to children born preterm (< 37 weeks gestational age). Yet, CA is rarely used in epidemiologic studies in low- and middle-income countries (LMICs), which may bias population estimates of childhood undernutrition. To evaluate the effect of accounting for GA in the application of growth standards, we used GA-specific standards at birth (INTERGROWTH-21st newborn size standards) in conjunction with CA for preterm-born children in the application of World Health Organization Child Growth Standards postnatally (referred to as 'CA' strategy) versus postnatal age for all children, to estimate mean length-for-age (LAZ) and weight-for-age (WAZ) <i>z</i> scores at 0, 3, 12, 24, and 48-months of age in the 2004 Pelotas (Brazil) Birth Cohort.</p><p><strong>Results: </strong>At birth (n = 4066), mean LAZ was higher and the prevalence of stunting (LAZ < -2) was lower using CA versus postnatal age (mean ± SD): - 0.36 ± 1.19 versus - 0.67 ± 1.32; and 8.3 versus 11.6%, respectively. Odds ratio (OR) and population attributable risk (PAR) of stunting due to preterm birth were attenuated and changed inferences using CA versus postnatal age at birth [OR, 95% confidence interval (CI): 1.32 (95% CI 0.95, 1.82) vs 14.7 (95% CI 11.7, 18.4); PAR 3.1 vs 42.9%]; differences in inferences persisted at 3-months. At 12, 24, and 48-months, preterm birth was associated with stunting, but ORs/PARs remained attenuated using CA compared to postnatal age. Findings were similar for weight-for-age <i>z</i> scores.</p><p><strong>Conclusions: </strong>Population-based epidemiologic studies in LMICs in which GA is unused or unavailable may overestimate the prevalence of early childhood undernutrition and inflate the fraction of undernutrition attributable to preterm birth.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"3"},"PeriodicalIF":2.3,"publicationDate":"2018-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5799899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35830088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kate Sabot, Tanya Marchant, Neil Spicer, Della Berhanu, Meenakshi Gautham, Nasir Umar, Joanna Schellenberg
{"title":"Contextual factors in maternal and newborn health evaluation: a protocol applied in Nigeria, India and Ethiopia.","authors":"Kate Sabot, Tanya Marchant, Neil Spicer, Della Berhanu, Meenakshi Gautham, Nasir Umar, Joanna Schellenberg","doi":"10.1186/s12982-018-0071-0","DOIUrl":"https://doi.org/10.1186/s12982-018-0071-0","url":null,"abstract":"<p><strong>Background: </strong>Understanding the context of a health programme is important in interpreting evaluation findings and in considering the external validity for other settings. Public health researchers can be imprecise and inconsistent in their usage of the word \"context\" and its application to their work. This paper presents an approach to defining context, to capturing relevant contextual information and to using such information to help interpret findings from the perspective of a research group evaluating the effect of diverse innovations on coverage of evidence-based, life-saving interventions for maternal and newborn health in Ethiopia, Nigeria, and India.</p><p><strong>Methods: </strong>We define \"context\" as the background environment or setting of any program, and \"contextual factors\" as those elements of context that could affect implementation of a programme. Through a structured, consultative process, contextual factors were identified while trying to strike a balance between comprehensiveness and feasibility. Thematic areas included demographics and socio-economics, epidemiological profile, health systems and service uptake, infrastructure, education, environment, politics, policy and governance. We outline an approach for capturing and using contextual factors while maximizing use of existing data. Methods include desk reviews, secondary data extraction and key informant interviews. Outputs include databases of contextual factors and summaries of existing maternal and newborn health policies and their implementation. Use of contextual data will be qualitative in nature and may assist in interpreting findings in both quantitative and qualitative aspects of programme evaluation.</p><p><strong>Discussion: </strong>Applying this approach was more resource intensive than expected, in part because routinely available information was not consistently available across settings and more primary data collection was required than anticipated. Data was used only minimally, partly due to a lack of evaluation results that needed further explanation, but also because contextual data was not available for the precise units of analysis or time periods of interest. We would advise others to consider integrating contextual factors within other data collection activities, and to conduct regular reviews of maternal and newborn health policies. This approach and the learnings from its application could help inform the development of guidelines for the collection and use of contextual factors in public health evaluation.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"2"},"PeriodicalIF":2.3,"publicationDate":"2018-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-018-0071-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35830087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An introduction to instrumental variable assumptions, validation and estimation.","authors":"Mette Lise Lousdal","doi":"10.1186/s12982-018-0069-7","DOIUrl":"https://doi.org/10.1186/s12982-018-0069-7","url":null,"abstract":"<p><p>The instrumental variable method has been employed within economics to infer causality in the presence of unmeasured confounding. Emphasising the parallels to randomisation may increase understanding of the underlying assumptions within epidemiology. An instrument is a variable that predicts exposure, but conditional on exposure shows no independent association with the outcome. The random assignment in trials is an example of what would be expected to be an ideal instrument, but instruments can also be found in observational settings with a naturally varying phenomenon e.g. geographical variation, physical distance to facility or physician's preference. The fourth identifying assumption has received less attention, but is essential for the generalisability of estimated effects. The instrument identifies the group of <i>compliers</i> in which exposure is pseudo-randomly assigned leading to exchangeability with regard to unmeasured confounders. Underlying assumptions can only partially be tested empirically and require subject-matter knowledge. Future studies employing instruments should carefully seek to validate all four assumptions, possibly drawing on parallels to randomisation.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"15 ","pages":"1"},"PeriodicalIF":2.3,"publicationDate":"2018-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-018-0069-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35782943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study.","authors":"R P Cornish, J Macleod, J R Carpenter, K Tilling","doi":"10.1186/s12982-017-0068-0","DOIUrl":"https://doi.org/10.1186/s12982-017-0068-0","url":null,"abstract":"<p><strong>Background: </strong>When an outcome variable is missing not at random (MNAR: probability of missingness depends on outcome values), estimates of the effect of an exposure on this outcome are often biased. We investigated the extent of this bias and examined whether the bias can be reduced through incorporating proxy outcomes obtained through linkage to administrative data as auxiliary variables in multiple imputation (MI).</p><p><strong>Methods: </strong>Using data from the Avon Longitudinal Study of Parents and Children (ALSPAC) we estimated the association between breastfeeding and IQ (continuous outcome), incorporating linked attainment data (proxies for IQ) as auxiliary variables in MI models. Simulation studies explored the impact of varying the proportion of missing data (from 20 to 80%), the correlation between the outcome and its proxy (0.1-0.9), the strength of the missing data mechanism, and having a proxy variable that was incomplete.</p><p><strong>Results: </strong>Incorporating a linked proxy for the missing outcome as an auxiliary variable reduced bias and increased efficiency in all scenarios, even when 80% of the outcome was missing. Using an incomplete proxy was similarly beneficial. High correlations (> 0.5) between the outcome and its proxy substantially reduced the missing information. Consistent with this, ALSPAC analysis showed inclusion of a proxy reduced bias and improved efficiency. Gains with additional proxies were modest.</p><p><strong>Conclusions: </strong>In longitudinal studies with loss to follow-up, incorporating proxies for this study outcome obtained via linkage to external sources of data as auxiliary variables in MI models can give practically important bias reduction and efficiency gains when the study outcome is MNAR.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"14 ","pages":"14"},"PeriodicalIF":2.3,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-017-0068-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35682082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Li, Ruth Keogh, John P Clancy, Rhonda D Szczesniak
{"title":"Flexible semiparametric joint modeling: an application to estimate individual lung function decline and risk of pulmonary exacerbations in cystic fibrosis.","authors":"Dan Li, Ruth Keogh, John P Clancy, Rhonda D Szczesniak","doi":"10.1186/s12982-017-0067-1","DOIUrl":"https://doi.org/10.1186/s12982-017-0067-1","url":null,"abstract":"<p><strong>Background: </strong>Epidemiologic surveillance of lung function is key to clinical care of individuals with cystic fibrosis, but lung function decline is nonlinear and often impacted by acute respiratory events known as pulmonary exacerbations. Statistical models are needed to simultaneously estimate lung function decline while providing risk estimates for the onset of pulmonary exacerbations, in order to identify relevant predictors of declining lung function and understand how these associations could be used to predict the onset of pulmonary exacerbations.</p><p><strong>Methods: </strong>Using longitudinal lung function (FEV<sub>1</sub>) measurements and time-to-event data on pulmonary exacerbations from individuals in the United States Cystic Fibrosis Registry, we implemented a flexible semiparametric joint model consisting of a mixed-effects submodel with regression splines to fit repeated FEV<sub>1</sub> measurements and a time-to-event submodel for possibly censored data on pulmonary exacerbations. We contrasted this approach with methods currently used in epidemiological studies and highlight clinical implications.</p><p><strong>Results: </strong>The semiparametric joint model had the best fit of all models examined based on deviance information criterion. Higher starting FEV<sub>1</sub> implied more rapid lung function decline in both separate and joint models; however, individualized risk estimates for pulmonary exacerbation differed depending upon model type. Based on shared parameter estimates from the joint model, which accounts for the nonlinear FEV<sub>1</sub> trajectory, patients with more positive rates of change were less likely to experience a pulmonary exacerbation (HR per one standard deviation increase in FEV<sub>1</sub> rate of change = 0.566, 95% CI 0.516-0.619), and having higher absolute FEV<sub>1</sub> also corresponded to lower risk of having a pulmonary exacerbation (HR per one standard deviation increase in FEV<sub>1</sub> = 0.856, 95% CI 0.781-0.937). At the population level, both submodels indicated significant effects of birth cohort, socioeconomic status and respiratory infections on FEV<sub>1</sub> decline, as well as significant effects of gender, socioeconomic status and birth cohort on pulmonary exacerbation risk.</p><p><strong>Conclusions: </strong>Through a flexible joint-modeling approach, we provide a means to simultaneously estimate lung function trajectories and the risk of pulmonary exacerbations for individual patients; we demonstrate how this approach offers additional insights into the clinical course of cystic fibrosis that were not possible using conventional approaches.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"14 ","pages":"13"},"PeriodicalIF":2.3,"publicationDate":"2017-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-017-0067-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35219501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Jarvis, Gian Luca Di Tanna, Daniel Lewis, Neal Alexander, W John Edmunds
{"title":"Spatial analysis of cluster randomised trials: a systematic review of analysis methods.","authors":"Christopher Jarvis, Gian Luca Di Tanna, Daniel Lewis, Neal Alexander, W John Edmunds","doi":"10.1186/s12982-017-0066-2","DOIUrl":"https://doi.org/10.1186/s12982-017-0066-2","url":null,"abstract":"<p><strong>Background: </strong>Cluster randomised trials (CRTs) often use geographical areas as the unit of randomisation, however explicit consideration of the location and spatial distribution of observations is rare. In many trials, the location of participants will have little importance, however in some, especially against infectious diseases, spillover effects due to participants being located close together may affect trial results. This review aims to identify spatial analysis methods used in CRTs and improve understanding of the impact of spatial effects on trial results.</p><p><strong>Methods: </strong>A systematic review of CRTs containing spatial methods, defined as a method that accounts for the structure, location, or relative distances between observations. We searched three sources: Ovid/Medline, Pubmed, and Web of Science databases. Spatial methods were categorised and details of the impact of spatial effects on trial results recorded.</p><p><strong>Results: </strong>We identified ten papers which met the inclusion criteria, comprising thirteen trials. We found that existing approaches fell into two categories; spatial variables and spatial modelling. The spatial variable approach was most common and involved standard statistical analysis of distance measurements. Spatial modelling is a more sophisticated approach which incorporates the spatial structure of the data within a random effects model. Studies tended to demonstrate the importance of accounting for location and distribution of observations in estimating unbiased effects.</p><p><strong>Conclusions: </strong>There have been a few attempts to control and estimate spatial effects within the context of human CRTs, but our overall understanding is limited. Although spatial effects may bias trial results, their consideration was usually a supplementary, rather than primary analysis. Further work is required to evaluate and develop the spatial methodologies relevant to a range of CRTs.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"14 ","pages":"12"},"PeriodicalIF":2.3,"publicationDate":"2017-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-017-0066-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35447180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decision trees in epidemiological research.","authors":"Ashwini Venkatasubramaniam, Julian Wolfson, Nathan Mitchell, Timothy Barnes, Meghan JaKa, Simone French","doi":"10.1186/s12982-017-0064-4","DOIUrl":"https://doi.org/10.1186/s12982-017-0064-4","url":null,"abstract":"<p><strong>Background: </strong>In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods.</p><p><strong>Main text: </strong>We review the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. We compare two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention. Both CART and CTree identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. An important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model. We also introduce a novel way to visualize the subgroups defined by decision trees. Our novel graphical visualization provides a more scientifically meaningful characterization of the subgroups identified by decision trees.</p><p><strong>Conclusions: </strong>Decision trees are a useful tool for identifying homogeneous subgroups defined by combinations of individual characteristics. While all decision tree techniques generate subgroups, we advocate the use of the newer CTree technique due to its simplicity and ease of interpretation.</p>","PeriodicalId":39896,"journal":{"name":"Emerging Themes in Epidemiology","volume":"14 ","pages":"11"},"PeriodicalIF":2.3,"publicationDate":"2017-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12982-017-0064-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35439732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}