T. Harkonen, Emma Hannula, M. Moores, E. Vartiainen, L. Roininen
{"title":"A log-Gaussian Cox process with sequential Monte Carlo for line narrowing in spectroscopy","authors":"T. Harkonen, Emma Hannula, M. Moores, E. Vartiainen, L. Roininen","doi":"10.3934/fods.2023008","DOIUrl":"https://doi.org/10.3934/fods.2023008","url":null,"abstract":"We propose a statistical model for narrowing line shapes in spectroscopy that are well approximated as linear combinations of Lorentzian or Voigt functions. We introduce a log-Gaussian Cox process to represent the peak locations thereby providing uncertainty quantification for the line narrowing. Bayesian formulation of the method allows for robust and explicit inclusion of prior information as probability distributions for parameters of the model. Estimation of the signal and its parameters is performed using a sequential Monte Carlo algorithm followed by an optimization step to determine the peak locations. Our method is validated using a simulation study and applied to a mineralogical Raman spectrum.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45413111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data based quantification of synchronization","authors":"","doi":"10.3934/fods.2022020","DOIUrl":"https://doi.org/10.3934/fods.2022020","url":null,"abstract":"","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Addressing confirmation bias in middle school data science education","authors":"S. Hedges, Kim Given","doi":"10.3934/fods.2021035","DOIUrl":"https://doi.org/10.3934/fods.2021035","url":null,"abstract":"More research is needed involving middle school students' engagement in the statistical problem-solving process, particularly the beginning process steps: formulate a question and make a plan to collect data/consider the data. Further, the increased availability of large-scale electronically accessible data sets is an untapped area of study. This interpretive study examined middle school students' understanding of statistical concepts involved in making a plan to collect data to answer a statistical question within a social issue context using data available on the internet. Student artifacts, researcher notes, and audio and video recordings from nine groups of 20 seventh-grade students in two gifted education pull-out classes at a suburban middle school were used to answer the study research questions. Data were analyzed using a priori codes from previously developed frameworks and by using an inductive approach to find themes.Three themes that emerged from data related to confirmation bias. Some middle school students held preconceptions about the social issues they chose to study that biased their statistical questions. This in turn influenced the sources of data students used to answer their questions. Confirmation bias is a serious issue that is exacerbated due to endless sources of data electronically available. We argue that this type of bias should be addressed early in students' educational experiences. Based on the findings from this study, we offer recommendations for future research and implications for statistics and data science education.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch
{"title":"Statistical inference for persistent homology applied to simulated fMRI time series data","authors":"H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch","doi":"10.3934/fods.2022014","DOIUrl":"https://doi.org/10.3934/fods.2022014","url":null,"abstract":"Time-series data are amongst the most widely-used in biomedical sciences, including domains such as functional Magnetic Resonance Imaging (fMRI). Structure within time series data can be captured by the tools of topological data analysis (TDA). Persistent homology is the mostly commonly used data-analytic tool in TDA, and can effectively summarize complex high-dimensional data into an interpretable 2-dimensional representation called a persistence diagram. Existing methods for statistical inference for persistent homology of data depend on an independence assumption being satisfied. While persistent homology can be computed for each time index in a time-series, time-series data often fail to satisfy the independence assumption. This paper develops a statistical test that obviates the independence assumption by implementing a multi-level block sampled Monte Carlo test with sets of persistence diagrams. Its efficacy for detecting task-dependent topological organization is then demonstrated on simulated fMRI data. This new statistical test is therefore suitable for analyzing persistent homology of fMRI data, and of non-independent data in general.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching data science to students in biology using R, RStudio and Learnr: Analysis of three years data","authors":"G. Engels, P. Grosjean, Frédérique Artus","doi":"10.3934/fods.2022022","DOIUrl":"https://doi.org/10.3934/fods.2022022","url":null,"abstract":"We examine the impact of implementing active pedagogical methodologies in three successive data science courses for a biology curriculum at the University of Mons, Belgium. Blended learning and flipped classroom approaches were adopted, with an emphasis on project-based biological data analysis. Four successive types of exercises of increasing difficulties were proposed to the students. Tutorials written with the R package learnr were identified as a critical step to transition between theory and the application of the concepts. The cognitive workload needed to complete the learnr tutorials was measured for the three courses and it was only lower for the last course, suggesting students needed a long time to get used to their software environment (R, RStudio and git). Data relative to students' activity, collected primarily from the ongoing assessment, were also used to establish student profiles according to their learning strategies. Several suboptimal strategies were observed and discussed. Finally, the timing of students contributions, and the intensity of teacher-learner interactions related to these contributions were analyzed before, during and after the mandatory distance learning due to the COVID-19 lockdown. A lag phase was visible at the beginning of the first lockdown, but the students' work was not markedly affected during the second lockdown period which lasted much longer.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying topological data analysis to local search problems","authors":"Erik Carlsson, J. Carlsson, Shannon Sweitzer","doi":"10.3934/fods.2022006","DOIUrl":"https://doi.org/10.3934/fods.2022006","url":null,"abstract":"<p style='text-indent:20px;'>We present an application of topological data analysis (TDA) to discrete optimization problems, which we show can improve the performance of the 2-opt local search method for the traveling salesman problem by simply applying standard Vietoris-Rips construction to a data set of trials. We then construct a simplicial complex which is specialized for this sort of simulated data set, determined by a stochastic matrix with a steady state vector <inline-formula><tex-math id=\"M1\">begin{document}$ (P,pi) $end{document}</tex-math></inline-formula>. When <inline-formula><tex-math id=\"M2\">begin{document}$ P $end{document}</tex-math></inline-formula> is induced from a random walk on a finite metric space, this complex exhibits similarities with standard constructions such as Vietoris-Rips on the data set, but is not sensitive to outliers, as sparsity is a natural feature of the construction. We interpret the persistent homology groups in several examples coming from random walks and discrete optimization, and illustrate how higher dimensional Betti numbers can be used to classify connected components, i.e. zero dimensional features in higher dimensions.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal correlations-based data clustering","authors":"Jia Chen, I. Schizas","doi":"10.3934/fods.2022011","DOIUrl":"https://doi.org/10.3934/fods.2022011","url":null,"abstract":"This work proposes a novel technique for clustering multimodal data according to their information content. Statistical correlations present in data that contain similar information are exploited to perform the clustering task. Specifically, multiset canonical correlation analysis is equipped with norm-one regularization mechanisms to identify clusters within different types of data that share the same information content. A pertinent minimization formulation is put forth, while block coordinate descent is employed to derive a batch clustering algorithm which achieves better clustering performance than existing alternatives. Relying on subgradient descent, an online clustering approach is derived which substantially lowers computational complexity compared to the batch approach, while not compromising significantly the clustering performance. It is established that for an increasing number of data the novel regularized multiset framework is able to correctly cluster the multimodal data entries. Further, it is proved that the online clustering scheme converges with probability one to a stationary point of the ensemble regularized multiset correlations cost having the potential to recover the correct clusters. Extensive numerical tests demonstrate that the novel clustering scheme outperforms existing alternatives, while the online scheme achieves substantial computational savings.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HOMOTOPY CONTINUATION FOR THE SPECTRA OF PERSISTENT LAPLACIANS.","authors":"Xiaoqi Wei, Guo-Wei Wei","doi":"10.3934/fods.2021017","DOIUrl":"https://doi.org/10.3934/fods.2021017","url":null,"abstract":"<p><p>The <i>p</i>-persistent <i>q</i>-combinatorial Laplacian defined for a pair of simplicial complexes is a generalization of the <i>q</i>-combinatorial Laplacian. Given a filtration, the spectra of persistent combinatorial Laplacians not only recover the persistent Betti numbers of persistent homology but also provide extra multiscale geometrical information of the data. Paired with machine learning algorithms, the persistent Laplacian has many potential applications in data science. Seeking different ways to find the spectrum of an operator is an active research topic, becoming interesting when ideas are originated from multiple fields. In this work, we explore an alternative approach for the spectrum of persistent Laplacians. As the eigenvalues of a persistent Laplacian matrix are the roots of its characteristic polynomial, one may attempt to find the roots of the characteristic polynomial by homotopy continuation, and thus resolving the spectrum of the corresponding persistent Laplacian. We consider a set of simple polytopes and small molecules to prove the principle that algebraic topology, combinatorial graph, and algebraic geometry can be integrated to understand the shape of data.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"3 4","pages":"677-700"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9273002/pdf/nihms-1768199.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40610845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of the feedback particle filter with diffusion map based approximation of the gain","authors":"S. Pathiraja, W. Stannat","doi":"10.3934/fods.2021023","DOIUrl":"https://doi.org/10.3934/fods.2021023","url":null,"abstract":"<p style='text-indent:20px;'>Control-type particle filters have been receiving increasing attention over the last decade as a means of obtaining sample based approximations to the sequential Bayesian filtering problem in the nonlinear setting. Here we analyse one such type, namely the feedback particle filter and a recently proposed approximation of the associated gain function based on diffusion maps. The key purpose is to provide analytic insights on the form of the approximate gain, which are of interest in their own right. These are then used to establish a roadmap to obtaining well-posedness and convergence of the finite <inline-formula><tex-math id=\"M1\">begin{document}$ N $end{document}</tex-math></inline-formula> system to its mean field limit. A number of possible future research directions are also discussed.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42664532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast computation of persistent homology representatives with involuted persistent homology","authors":"Matija vCufar, Žiga Virk","doi":"10.3934/fods.2023006","DOIUrl":"https://doi.org/10.3934/fods.2023006","url":null,"abstract":"Persistent homology is typically computed through persistent cohomology. While this generally improves the running time significantly, it does not facilitate extraction of homology representatives. The mentioned representatives are geometric manifestations of the corresponding holes and often carry desirable information. We propose a new method of extraction of persistent homology representatives using cohomology. In a nutshell, we first compute persistent cohomology and use the obtained information to significantly improve the running time of the direct persistent homology computations. This algorithm applied to Rips filtrations generally computes persistent homology representatives much faster than the standard methods.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48820334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}