{"title":"To democratize research with sensitive data, we should make synthetic data more accessible","authors":"Erik-Jan van Kesteren","doi":"arxiv-2404.17271","DOIUrl":"https://doi.org/arxiv-2404.17271","url":null,"abstract":"For over 30 years, synthetic data has been heralded as a promising solution\u0000to make sensitive datasets accessible. However, despite much research effort\u0000and several high-profile use-cases, the widespread adoption of synthetic data\u0000as a tool for open, accessible, reproducible research with sensitive data is\u0000still a distant dream. In this opinion, Erik-Jan van Kesteren, head of the\u0000ODISSEI Social Data Science team, argues that in order to progress towards\u0000widespread adoption of synthetic data as a privacy enhancing technology, the\u0000data science research community should shift focus away from developing better\u0000synthesis methods: instead, it should develop accessible tools, educate peers,\u0000and publish small-scale case studies.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Investigation into Distance Measures in Cluster Analysis","authors":"Zoe Shapcott","doi":"arxiv-2404.13664","DOIUrl":"https://doi.org/arxiv-2404.13664","url":null,"abstract":"This report provides an exploration of different distance measures that can\u0000be used with the $K$-means algorithm for cluster analysis. Specifically, we\u0000investigate the Mahalanobis distance, and critically assess any benefits it may\u0000have over the more traditional measures of the Euclidean, Manhattan and Maximum\u0000distances. We perform this by first defining the metrics, before considering\u0000their advantages and drawbacks as discussed in literature regarding this area.\u0000We apply these distances, first to some simulated data and then to subsets of\u0000the Dry Bean dataset [1], to explore if there is a better quality detectable\u0000for one metric over the others in these cases. One of the sections is devoted\u0000to analysing the information obtained from ChatGPT in response to prompts\u0000relating to this topic.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Seasonal and Periodic Patterns of PM2.5 in Manhattan using the Variable Bandpass Periodic Block Bootstrap","authors":"Yanan Sun, Edward Valachovic","doi":"arxiv-2404.08738","DOIUrl":"https://doi.org/arxiv-2404.08738","url":null,"abstract":"Air quality is a critical component of environmental health. Monitoring and\u0000analysis of particulate matter with a diameter of 2.5 micrometers or smaller\u0000(PM2.5) plays a pivotal role in understanding air quality changes. This study\u0000focuses on the application of a new bandpass bootstrap approach, termed the\u0000Variable Bandpass Periodic Block Bootstrap (VBPBB), for analyzing time series\u0000data which provides modeled predictions of daily mean PM2.5 concentrations over\u000016 years in Manhattan, New York, the United States. The VBPBB can be used to\u0000explore periodically correlated (PC) principal components for this daily mean\u0000PM2.5 dataset. This method uses bandpass filters to isolate distinct PC\u0000components from datasets, removing unwanted interference including noise, and\u0000bootstraps the PC components. This preserves the PC structure and permits a\u0000better understanding of the periodic characteristics of time series data. The\u0000results of the VBPBB are compared against outcomes from alternative block\u0000bootstrapping techniques. The findings of this research indicate potential\u0000trends of elevated PM2.5 levels, providing evidence of significant semi-annual\u0000and weekly patterns missed by other methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140566016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-Parametric Estimation of Multiple Periodic Components in Turkey's Electricity Consumption","authors":"Jie Yao, Edward Valachovic","doi":"arxiv-2404.03786","DOIUrl":"https://doi.org/arxiv-2404.03786","url":null,"abstract":"Electric generation and consumption are an essential component of\u0000contemporary living, influencing diverse facets of our daily routines,\u0000convenience, and economic progress. There is a high demand for characterizing\u0000the periodic pattern of electricity consumption. VBPBB employs a bandpass\u0000filter aligned to retain the frequency of a PC component and eliminating\u0000interference from other components. This leads to a significant reduction in\u0000the size of bootstrapped confidence intervals. Furthermore, other PC bootstrap\u0000methods preserve one but not multiple periodically correlated components,\u0000resulting in superior performance compared to other methods by providing a more\u0000precise estimation of the sampling distribution for the desired\u0000characteristics. The study of the periodic means of Turkey electricity\u0000consumption using VBPBB is presented and compared with outcomes from\u0000alternative bootstrapping approaches. These findings offer significant evidence\u0000supporting the existence of daily, weekly, and annual PC patterns, along with\u0000information on their timing and confidence intervals for their effects. This\u0000information is valuable for enhancing predictions and preparations for future\u0000responses to electricity consumption.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient estimation for a smoothing thin plate spline in a two-dimensional space","authors":"Joaquin Cavieres, Michael Karkulik","doi":"arxiv-2404.01902","DOIUrl":"https://doi.org/arxiv-2404.01902","url":null,"abstract":"Using a deterministic framework allows us to estimate a function with the\u0000purpose of interpolating data in spatial statistics. Radial basis functions are\u0000commonly used for scattered data interpolation in a d-dimensional space,\u0000however, interpolation problems have to deal with dense matrices. For the case\u0000of smoothing thin plate splines, we propose an efficient way to address this\u0000problem by compressing the dense matrix by an hierarchical matrix\u0000($mathcal{H}$-matrix) and using the conjugate gradient method to solve the\u0000linear system of equations. A simulation study was conducted to assess the\u0000effectiveness of the spatial interpolation method. The results indicated that\u0000employing an $mathcal{H}$-matrix along with the conjugate gradient method\u0000allows for efficient computations while maintaining a minimal error. We also\u0000provide a sensitivity analysis that covers a range of smoothing and compression\u0000parameter values, along with a Monte Carlo simulation aimed at quantifying\u0000uncertainty in the approximated function. Lastly, we present a comparative\u0000study between the proposed approach and thin plate regression using the \"mgcv\"\u0000package of the statistical software R. The comparison results demonstrate\u0000similar interpolation performance between the two methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization","authors":"Benoit Liquet, Sarat Moka, Samuel Muller","doi":"arxiv-2403.20007","DOIUrl":"https://doi.org/arxiv-2403.20007","url":null,"abstract":"The selection of best variables is a challenging problem in supervised and\u0000unsupervised learning, especially in high dimensional contexts where the number\u0000of variables is usually much larger than the number of observations. In this\u0000paper, we focus on two multivariate statistical methods: principal components\u0000analysis and partial least squares. Both approaches are popular linear\u0000dimension-reduction methods with numerous applications in several fields\u0000including in genomics, biology, environmental science, and engineering. In\u0000particular, these approaches build principal components, new variables that are\u0000combinations of all the original variables. A main drawback of principal\u0000components is the difficulty to interpret them when the number of variables is\u0000large. To define principal components from the most relevant variables, we\u0000propose to cast the best subset solution path method into principal component\u0000analysis and partial least square frameworks. We offer a new alternative by\u0000exploiting a continuous optimization algorithm for best subset solution path.\u0000Empirical studies show the efficacy of our approach for providing the best\u0000subset solution path. The usage of our algorithm is further exposed through the\u0000analysis of two real datasets. The first dataset is analyzed using the\u0000principle component analysis while the analysis of the second dataset is based\u0000on partial least square framework.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"122 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Name Popularity is a Good Test of Historicity","authors":"Luuk van de Weghe, Jason Wilson","doi":"arxiv-2403.14883","DOIUrl":"https://doi.org/arxiv-2403.14883","url":null,"abstract":"Are name statistics in the Gospels and Acts a good test of historicity? Kamil\u0000Gregor and Brian Blais, in a recent article in The Journal for the Study of the\u0000Historical Jesus, argue that the sample of name occurrences in the Gospels and\u0000Acts is too small to be determinative and that several statistical anomalies\u0000weigh against a positive verdict. Unfortunately, their conclusions result\u0000directly from improper testing and questionable data selection. Chi-squared\u0000goodness-of-fit testing establishes that name occurrences in the Gospels and\u0000Acts fit into their historical context at least as good as those in the works\u0000of Josephus. Additionally, they fit better than occurrences derived from\u0000ancient fictional sources and occurrences from modern, well-researched\u0000historical novels.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140301921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Risk Quadrangle and Robust Optimization Based on $varphi$-Divergence","authors":"Cheng Peng, Anton Malandii, Stan Uryasev","doi":"arxiv-2403.10987","DOIUrl":"https://doi.org/arxiv-2403.10987","url":null,"abstract":"This paper studies robust and distributionally robust optimization based on\u0000the extended $varphi$-divergence under the Fundamental Risk Quadrangle\u0000framework. We present the primal and dual representations of the quadrangle\u0000elements: risk, deviation, regret, error, and statistic. The framework provides\u0000an interpretation of portfolio optimization, classification and regression as\u0000robust optimization. We furnish illustrative examples demonstrating that many\u0000common problems are included in this framework. The $varphi$-divergence risk\u0000measure used in distributionally robust optimization is a special case. We\u0000conduct a case study to visualize the risk envelope.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140170962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithmic syntactic causal identification","authors":"Dhurim Cakiqi, Max A. Little","doi":"arxiv-2403.09580","DOIUrl":"https://doi.org/arxiv-2403.09580","url":null,"abstract":"Causal identification in causal Bayes nets (CBNs) is an important tool in\u0000causal inference allowing the derivation of interventional distributions from\u0000observational distributions where this is possible in principle. However, most\u0000existing formulations of causal identification using techniques such as\u0000d-separation and do-calculus are expressed within the mathematical language of\u0000classical probability theory on CBNs. However, there are many causal settings\u0000where probability theory and hence current causal identification techniques are\u0000inapplicable such as relational databases, dataflow programs such as hardware\u0000description languages, distributed systems and most modern machine learning\u0000algorithms. We show that this restriction can be lifted by replacing the use of\u0000classical probability theory with the alternative axiomatic foundation of\u0000symmetric monoidal categories. In this alternative axiomatization, we show how\u0000an unambiguous and clean distinction can be drawn between the general syntax of\u0000causal models and any specific semantic implementation of that causal model.\u0000This allows a purely syntactic algorithmic description of general causal\u0000identification by a translation of recent formulations of the general ID\u0000algorithm through fixing. Our description is given entirely in terms of the\u0000non-parametric ADMG structure specifying a causal model and the algebraic\u0000signature of the corresponding monoidal category, to which a sequence of\u0000manipulations is then applied so as to arrive at a modified monoidal category\u0000in which the desired, purely syntactic interventional causal model, is\u0000obtained. We use this idea to derive purely syntactic analogues of classical\u0000back-door and front-door causal adjustment, and illustrate an application to a\u0000more complex causal model.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Guidelines for the Creation of Analysis Ready Data","authors":"Harriette Phillips, Aiden Price, Owen Forbes, Claire Boulange, Kerrie Mengersen, Marketa Reeves, Rebecca Glauert","doi":"arxiv-2403.08127","DOIUrl":"https://doi.org/arxiv-2403.08127","url":null,"abstract":"Globally, there is an increased need for guidelines to produce high-quality\u0000data outputs for analysis. There is no framework currently exists providing\u0000guidelines for a comprehensive approach in producing analysis ready data (ARD).\u0000Through critically reviewing and summarising current literature, this paper\u0000proposes such guidelines for the creation of ARD. The guidelines proposed in\u0000this paper inform ten steps in the generation of ARD: ethics, project\u0000documentation, data governance, data management, data storage, data discovery\u0000and collection, data cleaning, quality assurance, metadata, and data\u0000dictionary. These steps are illustrated through a substantive case study which\u0000aimed to create ARD for a digital spatial platform: the Australian Child and\u0000Youth Wellbeing Atlas (ACYWA).","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}