{"title":"The Concept of Statistical Evidence","authors":"Michael Evans","doi":"10.11159/icsta21.002","DOIUrl":"https://doi.org/10.11159/icsta21.002","url":null,"abstract":"The concept of statistical evidence has proven to be somewhat elusive in the development of the discipline of Statistics. Still there is a conviction that appropriately collected data contains evidence concerning the answers to questions of scientific interest. We discuss some of the attempts at making the concept of evidence precise and, in particular, present an approach based upon measuring how beliefs change from a priori to a posteriori. Of necessity this is Bayesian in nature as a proper prior is required that reflects beliefs about where the truth lies before the data is observed. Bayesian inference is often criticized for its subjective nature. It is possible, however, to deal with this subjectivity in a scientifically sound manner. In part, this is done by assessing and controlling the bias the prior and model induce into inferences and this depends intrinsically on being clear about statistical evidence. In addition, the model and the prior are falsifiable through model checking and checking for prior-data conflict. Both the assessment of bias and the falsification steps are essentially frequentist in nature so this provides a degree of unity between sometimes conflicting philosophies. This approach to statistical reasoning can be seen as dealing with the inevitable subjectivity required in the choice of ingredients to an analysis so that a statistical analysis can approach the goal of objectivity that is central to scientific work.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134238665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robustness of Gaussian Mixture Reduction for Split-and-Conquer Learning of Finite Gaussian Mixtures","authors":"Qiong Zhang, Jiahua Chen","doi":"10.11159/icsta21.135","DOIUrl":"https://doi.org/10.11159/icsta21.135","url":null,"abstract":"In the era of big data, there is an increasing demand for split-and-conquer learning of finite mixture models. Recent work [1] proposes several split-and-conquer approaches for learning finite Gaussian mixtures and they are found to be both statistically and computationally efficient when the order of the mixture is correctly specified. Due to the nature of mixture models, correctly specifying the order of mixture on local machines can be an unrealistic assumption. In this paper, we evaluate the performance of several split-andconquer learning approaches, both when the order is correct and when it is over-specified on the local machines, based on simulations. We find that there is a trade-off between robustness and computational efficiency: the computationally intensive approach is robust against over-specification, while the two computationally friendly approaches have compromised statistical performance when the order is over-specified. The results suggest that the information in the data about the true distribution is not lost in the split step of the learning, and aggregation strategies must be developed in a computationally and statistically efficient way.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133420761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of Underlying Partial Differential Equations from Noisy Data with Splines","authors":"X. Huo","doi":"10.11159/icsta21.005","DOIUrl":"https://doi.org/10.11159/icsta21.005","url":null,"abstract":"","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131696588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria Ganopoulou, G. Sianos, I. Kangelidis, L. Angelis
{"title":"Prediction Model for the Result of Percutaneous Coronary Intervention in Coronary Chronic Total Occlusions","authors":"Maria Ganopoulou, G. Sianos, I. Kangelidis, L. Angelis","doi":"10.11159/icsta21.129","DOIUrl":"https://doi.org/10.11159/icsta21.129","url":null,"abstract":"Coronary chronic total occlusions (CTOs) are very common in patients undergoing coronary angiography. There has been an increasing acceptance of the percutaneous coronary interventions (PCI) in CTOs. The success rate of PCI has been boosted over the last few years by, among else, operator experience and advances in technology, even achieving levels of approximately 90%. This study proposes a prediction model for the classification of the cases in successful and unsuccessful operations and addresses the problem of class imbalance in the response variable (operation result). It is based on the EuroCTO Registry, which is the largest database available worldwide consisting of 29,995 cases for the period 2008-2018. Binary logistic regression analysis and down-sampling were applied within a customized step-algorithm and standard statistical accuracy measures were employed for the assessment of the prediction model, such as sensitivity, specificity and the value of the area under the ROC (AUROC) curve. The analysis revealed new predictive factors, validating at the same time the impact of well-known predictors. A brief comparison has been performed with other models from the literature, which showed that the proposed model performs similarly or better than its contemporary competitors.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129432646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Entity Resolution on Observed Social Network Structure","authors":"Abby M. Smith","doi":"10.11159/icsta21.136","DOIUrl":"https://doi.org/10.11159/icsta21.136","url":null,"abstract":"Extended Abstract Deduplication, also referred to as \"entity resolution\", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, \"ground-truth\" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130193482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Challenges for Studying Replication","authors":"J. Schauer","doi":"10.11159/icsta21.006","DOIUrl":"https://doi.org/10.11159/icsta21.006","url":null,"abstract":"Recent empirical research has questioned the replicability of scientific findings in various fields, including medicine, economics, and psychology. This research has also revealed that there is no clear-cut definition or standard analysis methods for replication. As a result, there has been substantial ambiguity over the proper way to design and analyze replication studies. This talk describes statistical considerations for studying replication, and examines their implications. It identifies some surprising statistical strengths and limitations of previous research, including the use of statistical methods with surprisingly high error rates. It then argues that such issues can be avoided in future efforts by taking into account key statistical considerations in the planning and analysis of replication studies.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123615267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic Version of EM Algorithm for Nonlinear Random ChangePoint Models","authors":"Hongbin Zhang, Binod Manandhar","doi":"10.11159/icsta21.119","DOIUrl":"https://doi.org/10.11159/icsta21.119","url":null,"abstract":"Random effect change-point models are commonly used to infer individual-specific time of event that induces trend change of longitudinal data. Linear models are often employed before and after the change point. However, in applications such as HIV studies, a mechanistic nonlinear model can be derived for the process based on the underlying data-generation mechanisms and such nonlinear model may provide better ``predictions\". In this article, we propose a random change-point model in which we model the longitudinal data by segmented nonlinear mixed effect models. Inference wise, we propose a maximum likelihood solution where we use the Stochastic Expectation-Maximization (StEM) algorithm coupled with independent multivariate rejection sampling through Gibbs’s sampler. We evaluate the method with simulations to gain insights.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115779844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sufficient Dimension Reduction with Deep Neural Networks for Phenotype Prediction","authors":"Siqi Liang, Wei-Heng Huang, F. Liang","doi":"10.11159/icsta21.134","DOIUrl":"https://doi.org/10.11159/icsta21.134","url":null,"abstract":"Phenotype prediction with genome-wide SNPs or biomarkers is a difficult problem in biomedical research due to many issues, such as nonlinearity of the underlying genetic mapping, high-dimensionality of SNP data, and insufficiency of training samples. To tackle this difficulty, we propose a split-and-merge deep neural network (SM-DNN) method, which employs the split-and-merge technique on deep neural networks to obtain nonlinear sufficient dimension reduction of the input data and then learn a deep neural network on the dimension reduced data. We show that the DNN-based dimension reduction is sufficient, which retains all information on response contained in the explanatory data. Our numerical experiments indicate that the SM-DNN method can lead to significant improvement in phenotype prediction for a variety of real data examples. In particular, with only rare variants, we achieved a remarkable prediction accuracy of over 74% for the Early-Onset Myocardial Infarction (EOMI) exome sequence data.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123928236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Analytic Power of Divide & Recombine (D&R)","authors":"W. Cleveland","doi":"10.11159/icsta21.003","DOIUrl":"https://doi.org/10.11159/icsta21.003","url":null,"abstract":"In D&R (aka Split & Conquer), the data are divided into subsets. The division serves as a base for analysis of big data and for data visualization. Different analytic processes are applied to the subsets that constitute a recombination of the information in the data. For big data there are three scenarios. (1) The division is based on the subject matter, e.g., financial data for 100 banks; the division is by bank, and the 100 outputs of analytic methods are further analyzed. (2) An analytic method is applied to each subset, and the outputs are recombined with a recombination method applied to get one result for all of the data. This can provide, for all if the data, estimates of parameters or more complex information such as a likelihood function. D&R research consists of finding division and recombination methods that maximize statistical accuracy. Parallel distributed environments like Hadoop and Spark provide high computational performance for (1) and (2). (3) Similarly, an analytic method is applied to all subsets, but an iterative MM algorithm is used for optimization, e.g., maximum likelihood, that among other nice properties can avoid very large matrix inversion, turn a non-differentiable problem into a smooth problem, etc. For visualization, subsets are created by conditioning on one more variables of the analysis to create subsets of the other variables in the analysis. The subsets are displayed using the Trellis Display framework of multi-panel display. This provides a very powerful mechanism for exploratory study of multi-dimensional datasets, modeling the data, and understanding the results of analysis.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116683058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Analysis of Measurements in Exact and Inexact Sciences: An Open Problem","authors":"L. Q. Amaral","doi":"10.11159/icsta21.126","DOIUrl":"https://doi.org/10.11159/icsta21.126","url":null,"abstract":"Differences between statistical analysis of measurements in exact and inexact sciences are the focus of this work. The early and independent beginning of Probability and Statistics had a theoretical synthesis, with an initial development based in Physics and Astronomy. This lead to Error Theory, used in Statistics of Measurements in Exact sciences, with defined criteria of validity. This direction of Mathematical Physics resulted in the progresses and achievements in Classical Physics, and also on established ways of treating measurements of physical properties. It is discussed that Exact Sciences treat only Inanimate Matter, and things that can be defined and measured, in terms of only seven fundamental physical quantities, with the definition of the International System of Units (SI). On the other hand a direction of Mathematical Statistics emerged later on, based on “Sampling”, to study properties of a population, with criteria of significance, within validity intervals, which depend on the size and characteristics of the studied sample, and on the inferences to be made in the research. These are two very different approaches, but both use probability density functions related to hypothesis about data. The modern inferential sampling statistics can be applied to all practical problems, in particular in Biology and Humanities, where there are “models”, but not Theories as in Physics. The word “theory” is many times used in a mistaken way. Life and Human Sciences use this modern type of Statistics. This paper discusses a particular case, in which the same ensemble of experimental results in samples of biological origin (hairs from hominoids) can be analyzed with the two different statistical approaches, in a proposal for Human Evolution, and the conditions for inference of accurate conclusions are discussed. A philosophical discussion between subjective and objective criteria of the researcher is made, and also of the concept of knowledge.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134472512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}