{"title":"A spectrum of explainable and interpretable machine learning approaches for genomic studies","authors":"A. M. Conard, Alan DenAdel, Lorin Crawford","doi":"10.1002/wics.1617","DOIUrl":"https://doi.org/10.1002/wics.1617","url":null,"abstract":"The advancement of high‐throughput genomic assays has led to enormous growth in the availability of large‐scale biological datasets. Over the last two decades, these increasingly complex data have required statistical approaches that are more sophisticated than traditional linear models. Machine learning methodologies such as neural networks have yielded state‐of‐the‐art performance for prediction‐based tasks in many biomedical applications. However, a notable downside of these machine learning models is that they typically do not reveal how or why accurate predictions are made. In many areas of biomedicine, this “black box” property can be less than desirable—particularly when there is a need to perform in silico hypothesis testing about a biological system, in addition to justifying model findings for downstream decision‐making, such as determining the best next experiment or treatment strategy. Explainable and interpretable machine learning approaches have emerged to overcome this issue. While explainable methods attempt to derive post hoc understanding of what a model has learned, interpretable models are designed to inherently provide an intelligible definition of their parameters and architecture. Here, we review the model transparency spectrum moving from black box and explainable, to interpretable machine learning methodology. Motivated by applications in genomics, we provide background on the advances across this spectrum, detailing specific approaches in both supervised and unsupervised learning. Importantly, we focus on the promise of incorporating existing biological knowledge when constructing interpretable machine learning methods for biomedical applications. We then close with considerations and opportunities for new development in this space.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49095612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Functional neuroimaging in the era of Big Data and Open Science: A modern overview","authors":"N. Lazar","doi":"10.1002/wics.1609","DOIUrl":"https://doi.org/10.1002/wics.1609","url":null,"abstract":"In the past 30 years, the statistical analysis of functional neuroimaging data has made much progress, and spurred many new research directions. At the same time, problems with reproducibility and replicability have plagued the field, owing in part to small sample sizes, a plethora of choices at the data preprocessing stage, and overall lack of transparency in reporting. The latter two in particular pose barriers to statisticians who want to become involved in the area. Recent efforts by some in the neuroimaging community to address these problems represent a turning point. This article highlights the current landscape and provides an introduction to some of the relevant resources in “open neuroimaging.”","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45039123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neuroimaging statistical approaches for determining neural correlates of Alzheimer's disease via positron emission tomography imaging","authors":"D. F. Drake, G. Derado, Lijun Zhang, F. D. Bowman","doi":"10.1002/wics.1606","DOIUrl":"https://doi.org/10.1002/wics.1606","url":null,"abstract":"Alzheimer's disease (AD) is a degenerative disorder involving significant memory loss and other cognitive deficits, manifesting as a progression from normal cognitive functioning to mild cognitive impairment to AD. The sooner an accurate diagnosis of probable AD is made, the easier it is to manage symptoms and plan for future therapy. Functional neuroimaging stands to be a useful tool in achieving early diagnosis. Among the many neuroimaging modalities, positron emission tomography (PET) provides direct regional assessment of, among others, brain metabolism, cerebral blood flow, amyloid deposition—all quantities of interest in the characterization of AD. However, there are analytic challenges in identifying early indicators of AD from these high‐dimensional imaging data sets, and it is unclear whether early indicators of AD are more likely to emerge in localized patterns of brain activity or in patterns of correlation between distinct brain regions. Early PET‐based analyses of AD focused on alterations in metabolic activity at the voxel‐level or in anatomically defined regions of interest. Other approaches, including seed‐voxel and multivariate techniques, seek to characterize metabolic connectivity by identifying other regions in the brain with similar patterns of activity across subjects. We briefly review various neuroimaging statistical approaches applied to determine changes in metabolic activity or metabolic connectivity associated with AD. We then present an approach that provides a unified statistical framework for addressing both metabolic activity and connectivity. Specifically, we apply a Bayesian spatial hierarchical framework to longitudinal metabolic PET scans from the Alzheimer's Disease Neuroimaging Initiative.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42907370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information criteria for model selection","authors":"Jiawei Zhang, Yuhong Yang, Jie Ding","doi":"10.1002/wics.1607","DOIUrl":"https://doi.org/10.1002/wics.1607","url":null,"abstract":"The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44224345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Integration in Causal Inference.","authors":"Xu Shi, Ziyang Pan, Wang Miao","doi":"10.1002/wics.1581","DOIUrl":"10.1002/wics.1581","url":null,"abstract":"<p><p>Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This paper reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trial with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two-sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real-world data, Bayesian causal inference, and causal discovery methods.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9880960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9621926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A review on authorship attribution in text mining","authors":"Wanwan Zheng, Mingzhe Jin","doi":"10.1002/wics.1584","DOIUrl":"https://doi.org/10.1002/wics.1584","url":null,"abstract":"The issue of authorship attribution has long been considered and continues to be a popular topic. Because of advances in digital computers, this field has experienced rapid developments in the last decade. In this article, a survey of recent advances in authorship attribution in text mining is presented. This survey focuses on authorship attribution methods that are statistically or computationally supported as opposed to traditional literary approaches. The main aspects covered include the changes in research topics over time, basic feature metrics, machine learning techniques, and the advantages and disadvantages of each approach. Moreover, the corpus size, number of candidates, data imbalance, and result description, all of which pose challenges in authorship attribution, are discussed to inform future work.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51219141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of smoothing techniques based on a backfitting algorithm in estimation of semiparametric additive models","authors":"S. E. Ahmed, D. Aydın, E. Yılmaz","doi":"10.1002/wics.1605","DOIUrl":"https://doi.org/10.1002/wics.1605","url":null,"abstract":"This paper aims to present an overview of Semiparametric additive models. An estimation of the finite‐parameters of semiparametric regression models that involve additive nonparametric components is explained, including their historical background. In addition, three different smoothing techniques are considered in order to show the working procedures of the estimators and to explore their statistical properties: smoothing splines, kernel smoothing and local linear regression. These methods are compared with respect to both their theoretical and practical behaviors. A simulation study and a real data example are carried out to reveal the performances of the three methods. Accordingly, the advantages and disadvantages of each method regarding semiparametric additive models are presented based on their comparative scores using determined evaluation metrics for loss of information.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49285273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diseases maps of spatial epidemiological data by R","authors":"T. Kubota","doi":"10.1002/wics.1604","DOIUrl":"https://doi.org/10.1002/wics.1604","url":null,"abstract":"Disease maps are essential when analyzing spatial epidemiological data, such as newly detected COVID‐19 positive cases or suicide deaths, because it is necessary to determine the method of analysis in order to perform spatial statistical analysis. Disease maps give an initial overview of the data and provide evidence of regional trends, which the analyst can check. Therefore, in this article, the author aimed to use R, a statistical data analysis tool, to draw spatial epidemiological data in the form of disease maps. This article presents three different methods and analyzes recent trends in COVID‐19 and suicide mortality. The author used monthly data from April, July, and October 2020. The results showed no significant trend in April, but some prefectures showed a negative correlation in July. On the other hand, some prefectures showed a positive correlation in October, confirming the influence of COVID‐19 on suicide by region.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46217630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Error control in tree structured hypothesis testing","authors":"J. Miecznikowski, Jiefei Wang","doi":"10.1002/wics.1603","DOIUrl":"https://doi.org/10.1002/wics.1603","url":null,"abstract":"This manuscript reviews some recent and popular error control methods for tree structured hypothesis testing. We review a common setting/definition for hypotheses arranged in a tree structure and we discuss two common Type I errors present in multiple testing: family wise error rates (FWERs) and false discovery rate (FDR). We also contrast these methods with a recent development designed to control the false selection rate (FSR). We discuss the algorithms used to implement these error controls and the strategies used to navigate tree structures in light of these errors. We highlight the assumptions necessary in these strategies, summarize the available R software packages to implement these approaches, and show them at work on an example.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41824131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}