Foundations of data science (Springfield, Mo.)最新文献_第2页

GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA. 基因表达数据完全反褶积的几何结构导向模型和算法

Foundations of data science (Springfield, Mo.) Pub Date : 2022-09-01 DOI: 10.3934/fods.2022013

Duan Chen, Shaoyu Li, Xue Wang

{"title":"GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA.","authors":"Duan Chen, Shaoyu Li, Xue Wang","doi":"10.3934/fods.2022013","DOIUrl":"10.3934/fods.2022013","url":null,"abstract":"Complete deconvolution analysis for bulk RNA-seq data is important and helpful to distinguish whether the differences of disease-associated GEPs (gene expression profiles) in tissues of patients and normal controls are due to changes in cellular composition of tissue samples, or due to GEPs changes in specific cells. One of the major techniques to perform complete deconvolution is nonnegative matrix factorization (NMF), which also has a wide-range of applications in the machine learning community. However, the NMF is a well-known strongly ill-posed problem, so a direct application of NMF to RNA-seq data will suffer severe difficulties in the interpretability of solutions. In this paper, we develop an NMF-based mathematical model and corresponding computational algorithms to improve the solution identifiability of deconvoluting bulk RNA-seq data. In our approach, we combine the biological concept of marker genes with the solvability conditions of the NMF theories, and develop a geometric structures guided optimization model. In this strategy, the geometric structure of bulk tissue data is first explored by the spectral clustering technique. Then, the identified information of marker genes is integrated as solvability constraints, while the overall correlation graph is used as manifold regularization. Both synthetic and biological data are used to validate the proposed model and algorithms, from which solution interpretability and accuracy are significantly improved.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":"441-466"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10798655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42614124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ASPECTS OF TOPOLOGICAL APPROACHES FOR DATA SCIENCE. 数据科学拓扑方法的各个方面。

Foundations of data science (Springfield, Mo.) Pub Date : 2022-06-01 DOI: 10.3934/fods.2022002

Jelena Grbić, Jie Wu, Kelin Xia, Guo-Wei Wei

引用次数: 0

A log-Gaussian Cox process with sequential Monte Carlo for line narrowing in spectroscopy 谱线窄化的对数高斯-考克斯过程

Foundations of data science (Springfield, Mo.) Pub Date : 2022-02-26 DOI: 10.3934/fods.2023008

T. Harkonen, Emma Hannula, M. Moores, E. Vartiainen, L. Roininen

引用次数: 0

Data based quantification of synchronization 基于数据的同步量化

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2022020

引用次数: 1

Addressing confirmation bias in middle school data science education 解决中学数据科学教育中的确认偏误

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2021035

S. Hedges, Kim Given

{"title":"Addressing confirmation bias in middle school data science education","authors":"S. Hedges, Kim Given","doi":"10.3934/fods.2021035","DOIUrl":"https://doi.org/10.3934/fods.2021035","url":null,"abstract":"More research is needed involving middle school students' engagement in the statistical problem-solving process, particularly the beginning process steps: formulate a question and make a plan to collect data/consider the data. Further, the increased availability of large-scale electronically accessible data sets is an untapped area of study. This interpretive study examined middle school students' understanding of statistical concepts involved in making a plan to collect data to answer a statistical question within a social issue context using data available on the internet. Student artifacts, researcher notes, and audio and video recordings from nine groups of 20 seventh-grade students in two gifted education pull-out classes at a suburban middle school were used to answer the study research questions. Data were analyzed using a priori codes from previously developed frameworks and by using an inductive approach to find themes.Three themes that emerged from data related to confirmation bias. Some middle school students held preconceptions about the social issues they chose to study that biased their statistical questions. This in turn influenced the sources of data students used to answer their questions. Confirmation bias is a serious issue that is exacerbated due to endless sources of data electronically available. We argue that this type of bias should be addressed early in students' educational experiences. Based on the findings from this study, we offer recommendations for future research and implications for statistics and data science education.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Statistical inference for persistent homology applied to simulated fMRI time series data 持续同源性的统计推断应用于模拟fMRI时间序列数据

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2022014

H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch

{"title":"Statistical inference for persistent homology applied to simulated fMRI time series data","authors":"H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch","doi":"10.3934/fods.2022014","DOIUrl":"https://doi.org/10.3934/fods.2022014","url":null,"abstract":"Time-series data are amongst the most widely-used in biomedical sciences, including domains such as functional Magnetic Resonance Imaging (fMRI). Structure within time series data can be captured by the tools of topological data analysis (TDA). Persistent homology is the mostly commonly used data-analytic tool in TDA, and can effectively summarize complex high-dimensional data into an interpretable 2-dimensional representation called a persistence diagram. Existing methods for statistical inference for persistent homology of data depend on an independence assumption being satisfied. While persistent homology can be computed for each time index in a time-series, time-series data often fail to satisfy the independence assumption. This paper develops a statistical test that obviates the independence assumption by implementing a multi-level block sampled Monte Carlo test with sets of persistence diagrams. Its efficacy for detecting task-dependent topological organization is then demonstrated on simulated fMRI data. This new statistical test is therefore suitable for analyzing persistent homology of fMRI data, and of non-independent data in general.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Teaching data science to students in biology using R, RStudio and Learnr: Analysis of three years data 使用R、RStudio和Learnr向生物学专业的学生教授数据科学:三年数据分析

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2022022

G. Engels, P. Grosjean, Frédérique Artus

{"title":"Teaching data science to students in biology using R, RStudio and Learnr: Analysis of three years data","authors":"G. Engels, P. Grosjean, Frédérique Artus","doi":"10.3934/fods.2022022","DOIUrl":"https://doi.org/10.3934/fods.2022022","url":null,"abstract":"We examine the impact of implementing active pedagogical methodologies in three successive data science courses for a biology curriculum at the University of Mons, Belgium. Blended learning and flipped classroom approaches were adopted, with an emphasis on project-based biological data analysis. Four successive types of exercises of increasing difficulties were proposed to the students. Tutorials written with the R package learnr were identified as a critical step to transition between theory and the application of the concepts. The cognitive workload needed to complete the learnr tutorials was measured for the three courses and it was only lower for the last course, suggesting students needed a long time to get used to their software environment (R, RStudio and git). Data relative to students' activity, collected primarily from the ongoing assessment, were also used to establish student profiles according to their learning strategies. Several suboptimal strategies were observed and discussed. Finally, the timing of students contributions, and the intensity of teacher-learner interactions related to these contributions were analyzed before, during and after the mandatory distance learning due to the COVID-19 lockdown. A lag phase was visible at the beginning of the first lockdown, but the students' work was not markedly affected during the second lockdown period which lasted much longer.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Applying topological data analysis to local search problems 将拓扑数据分析应用于局部搜索问题

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2022006

Erik Carlsson, J. Carlsson, Shannon Sweitzer

{"title":"Applying topological data analysis to local search problems","authors":"Erik Carlsson, J. Carlsson, Shannon Sweitzer","doi":"10.3934/fods.2022006","DOIUrl":"https://doi.org/10.3934/fods.2022006","url":null,"abstract":"We present an application of topological data analysis (TDA) to discrete optimization problems, which we show can improve the performance of the 2-opt local search method for the traveling salesman problem by simply applying standard Vietoris-Rips construction to a data set of trials. We then construct a simplicial complex which is specialized for this sort of simulated data set, determined by a stochastic matrix with a steady state vector <inline-formula><tex-math id=\"M1\">begin{document}$ (P,pi) $end{document}</tex-math></inline-formula>. When <inline-formula><tex-math id=\"M2\">begin{document}$ P $end{document}</tex-math></inline-formula> is induced from a random walk on a finite metric space, this complex exhibits similarities with standard constructions such as Vietoris-Rips on the data set, but is not sensitive to outliers, as sparsity is a natural feature of the construction. We interpret the persistent homology groups in several examples coming from random walks and discrete optimization, and illustrate how higher dimensional Betti numbers can be used to classify connected components, i.e. zero dimensional features in higher dimensions.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multimodal correlations-based data clustering 基于多模态相关的数据聚类

Foundations of data science (Springfield, Mo.) Pub Date : 2022-01-01 DOI: 10.3934/fods.2022011

Jia Chen, I. Schizas

{"title":"Multimodal correlations-based data clustering","authors":"Jia Chen, I. Schizas","doi":"10.3934/fods.2022011","DOIUrl":"https://doi.org/10.3934/fods.2022011","url":null,"abstract":"This work proposes a novel technique for clustering multimodal data according to their information content. Statistical correlations present in data that contain similar information are exploited to perform the clustering task. Specifically, multiset canonical correlation analysis is equipped with norm-one regularization mechanisms to identify clusters within different types of data that share the same information content. A pertinent minimization formulation is put forth, while block coordinate descent is employed to derive a batch clustering algorithm which achieves better clustering performance than existing alternatives. Relying on subgradient descent, an online clustering approach is derived which substantially lowers computational complexity compared to the batch approach, while not compromising significantly the clustering performance. It is established that for an increasing number of data the novel regularized multiset framework is able to correctly cluster the multimodal data entries. Further, it is proved that the online clustering scheme converges with probability one to a stationary point of the ensemble regularized multiset correlations cost having the potential to recover the correct clusters. Extensive numerical tests demonstrate that the novel clustering scheme outperforms existing alternatives, while the online scheme achieves substantial computational savings.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HOMOTOPY CONTINUATION FOR THE SPECTRA OF PERSISTENT LAPLACIANS. 持久拉普拉斯算子谱的同伦延拓。

Foundations of data science (Springfield, Mo.) Pub Date : 2021-12-01 DOI: 10.3934/fods.2021017

Xiaoqi Wei, Guo-Wei Wei

{"title":"HOMOTOPY CONTINUATION FOR THE SPECTRA OF PERSISTENT LAPLACIANS.","authors":"Xiaoqi Wei, Guo-Wei Wei","doi":"10.3934/fods.2021017","DOIUrl":"https://doi.org/10.3934/fods.2021017","url":null,"abstract":"The p-persistent q-combinatorial Laplacian defined for a pair of simplicial complexes is a generalization of the q-combinatorial Laplacian. Given a filtration, the spectra of persistent combinatorial Laplacians not only recover the persistent Betti numbers of persistent homology but also provide extra multiscale geometrical information of the data. Paired with machine learning algorithms, the persistent Laplacian has many potential applications in data science. Seeking different ways to find the spectrum of an operator is an active research topic, becoming interesting when ideas are originated from multiple fields. In this work, we explore an alternative approach for the spectrum of persistent Laplacians. As the eigenvalues of a persistent Laplacian matrix are the roots of its characteristic polynomial, one may attempt to find the roots of the characteristic polynomial by homotopy continuation, and thus resolving the spectrum of the corresponding persistent Laplacian. We consider a set of simple polytopes and small molecules to prove the principle that algebraic topology, combinatorial graph, and algebraic geometry can be integrated to understand the shape of data.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"3 4","pages":"677-700"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9273002/pdf/nihms-1768199.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40610845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2