{"title":"Casting multiple shadows: interactive data visualisation with tours and embeddings","authors":"Stuart Lee, U. Laa, D. Cook","doi":"10.52933/jdssv.v2i3.21","DOIUrl":"https://doi.org/10.52933/jdssv.v2i3.21","url":null,"abstract":"Non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences, however, the appropriate use of these methods is difficult because of their complex parameterisations; analysts must make trade-offs in order to identify structure in the visualisation of an NLDR technique. We present visual diagnostics for the pragmatic usage of NLDR methods by combining them with a technique called the tour. A tour is a sequence of interpolated linear projections of multivariate data onto a lower dimensional space. The sequence is displayed as a dynamic visualisation, allowing a user to see the shadows the high-dimensional data casts in a lower dimensional view. By linking the tour to an NLDR view, we can preserve global structure and through user interactions like linked brushing observe where the NLDR view may be misleading. We display several case studies from both simulations and single cell transcriptomics, that shows our approach is useful for cluster orientation tasks. The implementation of our framework is available as an R package called liminal available at https://github.com/sa-lee/liminal.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81018292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Gasparini, Tim P Morris, Michael J Crowther
{"title":"INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies.","authors":"Alessandro Gasparini, Tim P Morris, Michael J Crowther","doi":"10.52933/jdssv.v1i4.9","DOIUrl":"https://doi.org/10.52933/jdssv.v1i4.9","url":null,"abstract":"<p><p>Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed <b>INTEREST</b>: an <i>INteractive Tool for Exploring REsults from Simulation sTudies</i>. The tool has been developed using the <b>Shiny</b> framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, <b>INTEREST</b> can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations, readers to explore them freely.</p>","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"1 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7612246/pdf/EMS140699.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39949693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Generalization and Computation of Tukey's Depth: Part II","authors":"Yiyuan She, Shao Tang, Jingze Liu","doi":"10.52933/jdssv.v2i2.61","DOIUrl":"https://doi.org/10.52933/jdssv.v2i2.61","url":null,"abstract":"This paper studies how to generalize Tukey's depth to problems defined in a restricted space that may be curved or have boundaries, and to problems with a nondifferentiable objective. First, using a manifold approach, we propose a broad class of Riemannian \u0000depth for smooth problems defined on a Riemannian manifold, and showcase its applications in spherical data analysis, principal component analysis, and multivariate orthogonal regression. Moreover, for nonsmooth problems, we introduce additional slack variables and inequality constraints to define a novel slacked data depth, which can perform center-outward rankings of estimators arising from sparse learning and reduced rank regression. Real data examples illustrate the usefulness of some proposed data depths. \u0000 ","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86968094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Generalization and Computation of Tukey's Depth: Part I","authors":"Yiyuan She, S. Tang, Jingze Liu","doi":"10.52933/jdssv.v2i1.23","DOIUrl":"https://doi.org/10.52933/jdssv.v2i1.23","url":null,"abstract":"Tukey's depth offers a powerful tool for nonparametric inference and estimation, but also encounters serious computational and methodological difficulties in modern statistical data analysis. This paper studies how to generalize and compute Tukey-type depths in multi-dimensions. A general framework of influence-driven polished subspace depth, which emphasizes the importance of the underlying influence space and discrepancy measure, is introduced. The new matrix formulation enables us to utilize state-of-the-art optimization techniques to develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, half-space depth as well as regression depth can now be computed much faster than previously possible, with the support from extensive experiments. A companion paper is also offered to the reader in the same issue of this journal.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89122730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editorial Founding Issue","authors":"S. Aelst, P. Groenen","doi":"10.52933/jdssv.v1i1.52","DOIUrl":"https://doi.org/10.52933/jdssv.v1i1.52","url":null,"abstract":"The Journal of Data Science, Statistics, and Visualisation (JDSSV) is an electronic journal which welcomes contributions to data science, statistics, and visualisation, and in particular, those aspects which link and integrate these subject areas. Articles can cover topics such as machine learning and statistical learning, the visualisation and verbalisation of data, visual analytics, big data infrastructures and analytics, interactive learning, and advanced computing. Articles thatdiscuss two or more research areas of the journal are favoured. Scientific contributions should be of a high standard. Articles should be oriented towards a wide scientific audience of statisticians, data scientists, computer scientists, data analysts, etc. The journal welcomes original contributions that are not being considered for publication elsewhere and contain a high level of novelty. Articles with a thorough but concise review of a certain topic with the potential to provide new insights are also welcome. Manuscripts submitted to the journal generally are accompanied by supplementary material containing software code, data, technical derivations or detailed explanations, additional examples, etc. All submitted material will be reviewed by the assigned associate editor and reviewers of the manuscript.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86046756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Fabris-Rotelli, Jenny P. Holloway, Zaid Kimmie, S. Archibald, P. Debba, Raeesa Manjoo-Docrat, A. Roux, Nontembeko Dudeni-Tlhone, Charl Janse van Rensburg, R. Thiede, N. Abdelatif, Sibusisiwe Makhanya, Arminn Potgieter
{"title":"A Spatial SEIR Model for COVID-19 in South Africa","authors":"I. Fabris-Rotelli, Jenny P. Holloway, Zaid Kimmie, S. Archibald, P. Debba, Raeesa Manjoo-Docrat, A. Roux, Nontembeko Dudeni-Tlhone, Charl Janse van Rensburg, R. Thiede, N. Abdelatif, Sibusisiwe Makhanya, Arminn Potgieter","doi":"10.20944/PREPRINTS202106.0262.V1","DOIUrl":"https://doi.org/10.20944/PREPRINTS202106.0262.V1","url":null,"abstract":"The virus SARS-CoV-2 has resulted in numerous modelling approaches arising rapidly to understand the spread of the disease COVID-19 and to plan for future interventions. Herein, we present an SEIR model with a spatial spread component as well as four infectious compartments to account for the variety of symptom levels and transmission rate. The model takes into account the pattern of spatial vulnerability in South Africa through a vulnerability index that is based on socioeconomic and health susceptibility characteristics. Another spatially relevant factor in this context is level of mobility throughout. The thesis of this study is that without the contextual spatial spread modelling, the heterogeneity in COVID-19 prevalence in the South African setting would not be captured. The model is illustrated on South African COVID-19 case counts and hospitalisations.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"84 1","pages":"14-45"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85564927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Review of Containerization for Interactive and Reproducible Analysis","authors":"Gregory J. Hunt, Johann A. Gagnon-Bartsch","doi":"10.52933/jdssv.v3i1.53","DOIUrl":"https://doi.org/10.52933/jdssv.v3i1.53","url":null,"abstract":"In recent decades the analysis of data has become increasingly computational. Correspondingly, this has changed how scientific and statistical work is shared. For example, it is now commonplace for underlying analysis code and data to be proffered alongside journal publications and conference talks. Unfortunately, sharing code faces several challenges. First, it is often difficult to take code from one computer and run it on another. Code configuration, version, and dependency issues often make this challenging. Secondly, even if the code runs, it is often hard to understand or interact with the analysis. This makes it difficult to assess the code and its findings, for example, in a peer review process. In this review we describe the combination of two computing technologies that help make analyses shareable, interactive, and completely reproducible. These technologies are (1) analysis containerization, which leverages virtualization to fully encapsulate analysis, data, code and dependencies into an interactive and shareable format, and (2) code notebooks, a literate programming format for interacting with analyses. The fusion of these two technologies offers significant advantages over using either individually. This review surveys how the combination enhances the accessibility and reproducibility of code, analyses, and ideas.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73237900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Model-Based Clustering","authors":"Juan D. González, R. Maronna, V. Yohai, R. Zamar","doi":"10.1201/b18358-20","DOIUrl":"https://doi.org/10.1201/b18358-20","url":null,"abstract":"We propose a class of Fisher-consistent robust estimators for mixture models. These estimators are then used to build a robust model-based clustering procedure. We study in detail the case of multivariate Gaussian mixtures and propose an algorithm, similar to the EM algorithm, to compute the proposed estimators and build the robust clusters. An extensive Monte Carlo simulation study shows that our proposal outperforms other robust and non robust, state of the art, model-based clustering procedures. We apply our proposal to a real data set and show that again it outperforms alternative procedures.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85674805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Handling Cellwise Outliers by Sparse Regression and Robust Covariance","authors":"Jakob Raymaekers, P. Rousseeuw","doi":"10.52933/jdssv.v1i3.18","DOIUrl":"https://doi.org/10.52933/jdssv.v1i3.18","url":null,"abstract":"We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellFlagger technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82184537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron Defazio, M. Tygert, Rachel A. Ward, Jure Zbontar
{"title":"Compressed sensing with a jackknife and a bootstrap","authors":"Aaron Defazio, M. Tygert, Rachel A. Ward, Jure Zbontar","doi":"10.52933/jdssv.v2i4.43","DOIUrl":"https://doi.org/10.52933/jdssv.v2i4.43","url":null,"abstract":"Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured (based on a potentially unjustified regularizer or prior distribution). Compressed sensing therefore risks introducing errors -- inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. Estimating errors using the standard statistical tools of a jackknife and a bootstrap yields \"error bars\" in the form of full images that are remarkably qualitatively representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors -- without recourse to measuring the entire ground truth directly -- and build confidence in regions of the images where the estimated errors are small. Further visualizations and summary statistics can aid in the interpretation of such error estimates. Visualizations include suitable colorizations of the reconstruction, as well as the obvious \"correction\" of the reconstruction by subtracting off the error estimates. The canonical summary statistic would be the root-mean-square of the error estimates. Unfortunately, colorizations appear likely to be too distracting for actual clinical practice in medical imaging, and the root-mean-square gets swamped by background noise in the error estimates. Fortunately, straightforward displays of the error estimates and of the \"corrected\" reconstruction are illuminating, and the root-mean-square improves greatly after mild blurring of the error estimates; the blurring is barely perceptible to the human eye yet smooths away background noise that would otherwise overwhelm the root-mean-square.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85227334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}