{"title":"A framework for understanding data science","authors":"Michael L Brodie","doi":"arxiv-2403.00776","DOIUrl":"https://doi.org/arxiv-2403.00776","url":null,"abstract":"The objective of this research is to provide a framework with which the data\u0000science community can understand, define, and develop data science as a field\u0000of inquiry. The framework is based on the classical reference framework\u0000(axiology, ontology, epistemology, methodology) used for 200 years to define\u0000knowledge discovery paradigms and disciplines in the humanities, sciences,\u0000algorithms, and now data science. I augmented it for automated problem-solving\u0000with (methods, technology, community). The resulting data science reference\u0000framework is used to define the data science knowledge discovery paradigm in\u0000terms of the philosophy of data science addressed in previous papers and the\u0000data science problem-solving paradigm, i.e., the data science method, and the\u0000data science problem-solving workflow, both addressed in this paper. The\u0000framework is a much called for unifying framework for data science as it\u0000contains the components required to define data science. For insights to better\u0000understand data science, this paper uses the framework to define the emerging,\u0000often enigmatic, data science problem-solving paradigm and workflow, and to\u0000compare them with their well-understood scientific counterparts, scientific\u0000problem-solving paradigm and workflow.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable, synergy-first backbone decomposition of higher-order structures in complex systems","authors":"Thomas F. Varley","doi":"arxiv-2402.08135","DOIUrl":"https://doi.org/arxiv-2402.08135","url":null,"abstract":"Since its introduction in 2011, the partial information decomposition (PID)\u0000has triggered an explosion of interest in the field of multivariate information\u0000theory and the study of emergent, higher-order (\"synergistic\") interactions in\u0000complex systems. Despite its power, however, the PID has a number of\u0000limitations that restrict its general applicability: it scales poorly with\u0000system size and the standard approach to decomposition hinges on a definition\u0000of \"redundancy\", leaving synergy only vaguely defined as \"that information not\u0000redundant.\" Other heuristic measures, such as the O-information, have been\u0000introduced, although these measures typically only provided a summary statistic\u0000of redundancy/synergy dominance, rather than direct insight into the synergy\u0000itself. To address this issue, we present an alternative decomposition that is\u0000synergy-first, scales much more gracefully than the PID, and has a\u0000straightforward interpretation. Our approach defines synergy as that\u0000information in a set that would be lost following the minimally invasive\u0000perturbation on any single element. By generalizing this idea to sets of\u0000elements, we construct a totally ordered \"backbone\" of partial synergy atoms\u0000that sweeps systems scales. Our approach starts with entropy, but can be\u0000generalized to the Kullback-Leibler divergence, and by extension, to the total\u0000correlation and the single-target mutual information. Finally, we show that\u0000this approach can be used to decompose higher-order interactions beyond just\u0000information theory: we demonstrate this by showing how synergistic combinations\u0000of pairwise edges in a complex network supports signal communicability and\u0000global integration. We conclude by discussing how this perspective on\u0000synergistic structure (information-based or otherwise) can deepen our\u0000understanding of part-whole relationships in complex systems.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139764541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Mathlink Cubes to Introduce Data Wrangling with Examples in R","authors":"Lucy D'Agostino McGowan","doi":"arxiv-2402.07029","DOIUrl":"https://doi.org/arxiv-2402.07029","url":null,"abstract":"This paper explores an innovative approach to teaching data wrangling skills\u0000to students through hands-on activities before transitioning to coding. Data\u0000wrangling, a critical aspect of data analysis, involves cleaning, transforming,\u0000and restructuring data. We introduce the use of a physical tool, mathlink\u0000cubes, to facilitate a tangible understanding of data sets. This approach helps\u0000students grasp the concepts of data wrangling before implementing them in\u0000coding languages such as R. We detail a classroom activity that includes\u0000hands-on tasks paralleling common data wrangling processes such as filtering,\u0000selecting, and mutating, followed by their coding equivalents using R's `dplyr`\u0000package.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139764450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool","authors":"Arslan Akram","doi":"arxiv-2403.13812","DOIUrl":"https://doi.org/arxiv-2403.13812","url":null,"abstract":"Many people are interested in ChatGPT since it has become a prominent AIGC\u0000model that provides high-quality responses in various contexts, such as\u0000software development and maintenance. Misuse of ChatGPT might cause significant\u0000issues, particularly in public safety and education, despite its immense\u0000potential. The majority of researchers choose to publish their work on Arxiv.\u0000The effectiveness and originality of future work depend on the ability to\u0000detect AI components in such contributions. To address this need, this study\u0000will analyze a method that can see purposely manufactured content that academic\u0000organizations use to post on Arxiv. For this study, a dataset was created using\u0000physics, mathematics, and computer science articles. Using the newly built\u0000dataset, the following step is to put originality.ai through its paces. The\u0000statistical analysis shows that Originality.ai is very accurate, with a rate of\u000098%.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140205779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Malaria incidence and prevalence: An ecological analysis through Six Sigma approach","authors":"Md. Al-Amin, Kesava Chandran Vijaya Bhaskar, Walaa Enab, Reza Kamali Miab, Jennifer Slavin, Nigar Sultana","doi":"arxiv-2402.02233","DOIUrl":"https://doi.org/arxiv-2402.02233","url":null,"abstract":"Malaria is the leading cause of death globally, especially in sub-Saharan\u0000African countries claiming over 400,000 deaths globally each year, underscoring\u0000the critical need for continued efforts to combat this preventable and\u0000treatable disease. The objective of this study is to provide statistical\u0000guidance on the optimal preventive and control measures against malaria. Data\u0000have been collected from reliable sources, such as World Health Organization,\u0000UNICEF, Our World in Data, and STATcompiler. Data were categorized according to\u0000the factors and sub-factors related to deaths caused by malaria. These factors\u0000and sub-factors were determined based on root cause analysis and data sources.\u0000Using JMP 16 Pro software, both linear and multiple linear regression were\u0000conducted to analyze the data. The analyses aimed to establish a linear\u0000relationship between the dependent variable (malaria deaths in the overall\u0000population) and independent variables, such as life expectancy, malaria\u0000prevalence in children, net usage, indoor residual spraying usage, literate\u0000population, and population with inadequate sanitation in each selected sample\u0000country. The statistical analysis revealed that using insecticide treated nets\u0000(ITNs) by children and individuals significantly decreased the death count, as\u00001,000 individuals sleeping under ITNs could reduce the death count by eight.\u0000Based on the statistical analysis, this study suggests more rigorous research\u0000on the usage of ITNs.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139767019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chixiang Chen, Michelle Shardell, Jaime Lynn Speiser, Karen Bandeen-Roche, Heather Allore, Thomas G Travison, Michael Griswold, Terrence E. Murphy
{"title":"Gerontologic Biostatistics 2.0: Developments over 10+ years in the age of data science","authors":"Chixiang Chen, Michelle Shardell, Jaime Lynn Speiser, Karen Bandeen-Roche, Heather Allore, Thomas G Travison, Michael Griswold, Terrence E. Murphy","doi":"arxiv-2402.01112","DOIUrl":"https://doi.org/arxiv-2402.01112","url":null,"abstract":"Background: Introduced in 2010, the sub-discipline of gerontologic\u0000biostatistics (GBS) was conceptualized to address the specific challenges in\u0000analyzing data from research studies involving older adults. However, the\u0000evolving technological landscape has catalyzed data science and statistical\u0000advancements since the original GBS publication, greatly expanding the scope of\u0000gerontologic research. There is a need to describe how these advancements\u0000enhance the analysis of multi-modal data and complex phenotypes that are\u0000hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an\u0000updated and expanded set of analytical methods reflective of the practice of\u0000gerontologic biostatistics in contemporary and future research. Results: GBS\u00002.0 topics and relevant software resources include cutting-edge methods in\u0000experimental design; analytical techniques that include adaptations of machine\u0000learning, quantifying deep phenotypic measurements, high-dimensional -omics\u0000analysis; the integration of information from multiple studies, and strategies\u0000to foster reproducibility, replicability, and open science. Discussion: The\u0000methodological topics presented here seek to update and expand GBS. By\u0000facilitating the synthesis of biostatistics and data science in gerontology, we\u0000aim to foster the next generation of gerontologic researchers.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"236 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139690246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A review of regularised estimation methods and cross-validation in spatiotemporal statistics","authors":"Philipp Otto, Alessandro Fassò, Paolo Maranzano","doi":"arxiv-2402.00183","DOIUrl":"https://doi.org/arxiv-2402.00183","url":null,"abstract":"This review article focuses on regularised estimation procedures applicable\u0000to geostatistical and spatial econometric models. These methods are\u0000particularly relevant in the case of big geospatial data for dimensionality\u0000reduction or model selection. To structure the review, we initially consider\u0000the most general case of multivariate spatiotemporal processes (i.e., $g > 1$\u0000dimensions of the spatial domain, a one-dimensional temporal domain, and $q\u0000geq 1$ random variables). Then, the idea of regularised/penalised estimation\u0000procedures and different choices of shrinkage targets are discussed. Finally,\u0000guided by the elements of a mixed-effects model, which allows for a variety of\u0000spatiotemporal models, we show different regularisation procedures and how they\u0000can be used for the analysis of geo-referenced data, e.g. for selection of\u0000relevant regressors, dimensionality reduction of the covariance matrices,\u0000detection of conditionally independent locations, or the estimation of a full\u0000spatial interaction matrix.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"2 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139668263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Lin, Per Olof Hedekvist, Nina Mylly, Math Bollen, Jingchun Shen, Jiawei Xiong, Christofer Silfvenius
{"title":"Human-Centric and Integrative Lighting Asset Management in Public Libraries: Qualitative Insights and Challenges from a Swedish Field Study","authors":"Jing Lin, Per Olof Hedekvist, Nina Mylly, Math Bollen, Jingchun Shen, Jiawei Xiong, Christofer Silfvenius","doi":"arxiv-2401.11000","DOIUrl":"https://doi.org/arxiv-2401.11000","url":null,"abstract":"Traditional lighting source reliability evaluations, often covering just half\u0000of a lamp's volume, can misrepresent real-world performance. To overcome these\u0000limitations,adopting advanced asset management strategies for a more holistic\u0000evaluation is crucial. This paper investigates human-centric and integrative\u0000lighting asset management in Swedish public libraries. Through field\u0000observations, interviews, and gap analysis, the study highlights a disparity\u0000between current lighting conditions and stakeholder expectations, with issues\u0000like eye strain suggesting significant improvement potential. We propose a\u0000shift towards more dynamic lighting asset management and reliability\u0000evaluations, emphasizing continuous enhancement and comprehensive training in\u0000human-centric and integrative lighting principles.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"117 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johan Medrano, Abderrahmane Kheddar, Annick Lesne, Sofiane Ramdani
{"title":"Radius selection using kernel density estimation for the computation of nonlinear measures","authors":"Johan Medrano, Abderrahmane Kheddar, Annick Lesne, Sofiane Ramdani","doi":"arxiv-2401.03891","DOIUrl":"https://doi.org/arxiv-2401.03891","url":null,"abstract":"When nonlinear measures are estimated from sampled temporal signals with\u0000finite-length, a radius parameter must be carefully selected to avoid a poor\u0000estimation. These measures are generally derived from the correlation integral\u0000which quantifies the probability of finding neighbors, i.e. pair of points\u0000spaced by less than the radius parameter. While each nonlinear measure comes\u0000with several specific empirical rules to select a radius value, we provide a\u0000systematic selection method. We show that the optimal radius for nonlinear\u0000measures can be approximated by the optimal bandwidth of a Kernel Density\u0000Estimator (KDE) related to the correlation sum. The KDE framework provides\u0000non-parametric tools to approximate a density function from finite samples\u0000(e.g. histograms) and optimal methods to select a smoothing parameter, the\u0000bandwidth (e.g. bin width in histograms). We use results from KDE to derive a\u0000closed-form expression for the optimal radius. The latter is used to compute\u0000the correlation dimension and to construct recurrence plots yielding an\u0000estimate of Kolmogorov-Sinai entropy. We assess our method through numerical\u0000experiments on signals generated by nonlinear systems and experimental\u0000electroencephalographic time series.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"254 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quotient geometry of bounded or fixed rank correlation matrices","authors":"Hengchao Chen","doi":"arxiv-2401.03126","DOIUrl":"https://doi.org/arxiv-2401.03126","url":null,"abstract":"This paper studies the quotient geometry of bounded or fixed-rank correlation\u0000matrices. The set of bounded-rank correlation matrices is in bijection with a\u0000quotient set of a spherical product manifold by an orthogonal group. We show\u0000that it admits an orbit space structure and its stratification is determined by\u0000the rank of the matrices. Also, the principal stratum has a compatible\u0000Riemannian quotient manifold structure. We develop efficient Riemannian\u0000optimization algorithms for computing the distance and the weighted Frechet\u0000mean in the orbit space. We prove that any minimizing geodesic in the orbit\u0000space has constant rank on the interior of the segment. Moreover, we examine\u0000geometric properties of the quotient manifold, including horizontal and\u0000vertical spaces, Riemannian metric, injectivity radius, exponential and\u0000logarithmic map, gradient and Hessian.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}