{"title":"Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations","authors":"Reetika Sarkar, Sithija Manage, Xiaoli Gao","doi":"10.1007/s40745-023-00481-5","DOIUrl":"10.1007/s40745-023-00481-5","url":null,"abstract":"<div><p>High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1139 - 1164"},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135049935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kousik Maiti, Suchandan Kayal, Aditi Kar Gangopadhyay
{"title":"On Progressively Censored Generalized X-Exponential Distribution: (Non) Bayesian Estimation with an Application to Bladder Cancer Data","authors":"Kousik Maiti, Suchandan Kayal, Aditi Kar Gangopadhyay","doi":"10.1007/s40745-023-00477-1","DOIUrl":"10.1007/s40745-023-00477-1","url":null,"abstract":"<div><p>This article addresses estimation of the parameters and reliability characteristics of a generalized <i>X</i>-Exponential distribution based on the progressive type-II censored sample. The maximum likelihood estimates (MLEs) are obtained. The uniqueness and existence of the MLEs are studied. The Bayes estimates are obtained under squared error and entropy loss functions. For computation of the Bayes estimates, Markov Chain Monte Carlo method is used. Bootstrap-<i>t</i> and bootstrap-<i>p</i> methods are used to compute the interval estimates. Further, a simulation study is performed to compare the performance of the proposed estimates. Finally, a real-life dataset is considered and analysed for illustrative purposes.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1761 - 1798"},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45717684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WeiKang Liu, Yanchun Zhang, Hong Yang, Qinxue Meng
{"title":"A Survey on Differential Privacy for Medical Data Analysis","authors":"WeiKang Liu, Yanchun Zhang, Hong Yang, Qinxue Meng","doi":"10.1007/s40745-023-00475-3","DOIUrl":"10.1007/s40745-023-00475-3","url":null,"abstract":"<div><p>Machine learning methods promote the sustainable development of wise information technology of medicine (WITMED), and a variety of medical data brings high value and convenience to medical analysis. However, the applications of medical data have also been confronted with the risk of privacy leakage that is hard to avoid, especially when conducting correlation analysis or data sharing among multiple institutions. Data security and privacy preservation have recently played an essential role in the field of secure and private medical data analysis, where many differential privacy strategies are applied to medical data publishing and mining. In this paper, we survey research work on the applications of differential privacy for medical data analysis, discussing the necessity of medical privacy-preserving, the advantages of differential privacy, and their applications to typical medical data, such as genomic data and wearable device data. Furthermore, we discuss the challenges and potential future research directions for differential privacy in medical applications.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 2","pages":"733 - 747"},"PeriodicalIF":0.0,"publicationDate":"2023-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47520588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Naïve Bayes Classifier Model for Detecting Spam Mails","authors":"Shrawan Kumar, Kavita Gupta, Manya Gupta","doi":"10.1007/s40745-023-00479-z","DOIUrl":"10.1007/s40745-023-00479-z","url":null,"abstract":"<div><p>In this paper, the machine learning algorithm Naive Bayes Classifier is applied to the Kaggle spam mails dataset to classify the emails in our inbox as spam or ham. The dataset is made up of two main attributes: type and text. The target variable \"Type\" has two factors: ham and spam. The text variable contains the text messages that will be classified as spam or ham. The results are obtained by employing two different Laplace values. It is up to the decision maker to select error tolerance in ham and spam messages derived from two different Laplace values. Computing software R is used for data analysis.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 6","pages":"1887 - 1897"},"PeriodicalIF":0.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43486989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clemens Tegetmeier, Arne Johannssen, Nataliya Chukhrova
{"title":"Artificial Intelligence Algorithms for Collaborative Book Recommender Systems","authors":"Clemens Tegetmeier, Arne Johannssen, Nataliya Chukhrova","doi":"10.1007/s40745-023-00474-4","DOIUrl":"10.1007/s40745-023-00474-4","url":null,"abstract":"<div><p>Book recommender systems provide personalized recommendations of books to users based on their previous searches or purchases. As online trading of books has become increasingly important in recent years, artificial intelligence (AI) algorithms are needed to recommend suitable books to users and encourage them to make purchasing decisions in the short and the long run. In this paper, we consider AI algorithms for so called collaborative book recommender systems, especially the matrix factorization algorithm using the stochastic gradient descent method and the book-based <i>k</i>-nearest-neighbor algorithm. We perform a comprehensive case study based on the Book-Crossing benchmark data set, and implement various variants of both AI algorithms to predict unknown book ratings and to recommend books to individual users based on the highest predicted ratings. This study aims to evaluate the quality of the implemented methods in recommending books by using selected evaluation metrics for AI algorithms.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1705 - 1739"},"PeriodicalIF":0.0,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-023-00474-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45942766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Poisson Moment Exponential Distribution with Associated Regression and INAR(1) Process","authors":"R. Maya, Jie Huang, M. R. Irshad, Fukang Zhu","doi":"10.1007/s40745-023-00476-2","DOIUrl":"10.1007/s40745-023-00476-2","url":null,"abstract":"<div><p>Numerous studies have emphasised the significance of count data modeling and its applications to phenomena that occur in the real world. From this perspective, this article examines the traits and applications of the Poisson-moment exponential (PME) distribution in the contexts of time series analysis and regression analysis for real-world phenomena. The PME distribution is a novel one-parameter discrete distribution that can be used as a powerful alternative for the existing distributions for modeling over-dispersed count datasets. The advantages of the PME distribution, including the simplicity of the probability mass function and the explicit expressions of the functions of all the statistical properties, drove us to develop the inferential aspects and learn more about its practical applications. The unknown parameter is estimated using both maximum likelihood and moment estimation methods. Also, we present a parametric regression model based on the PME distribution for the count datasets. To strengthen the utility of the suggested distribution, we propose a new first-order integer-valued autoregressive (INAR(1)) process with PME innovations based on binomial thinning for modeling integer-valued time series with over-dispersion. Application to four real datasets confirms the empirical significance of the proposed model.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1741 - 1759"},"PeriodicalIF":0.0,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43264212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Compound Distribution and Its Applications in Over-dispersed Count Data","authors":"Peer Bilal Ahmad, Mohammad Kafeel Wani","doi":"10.1007/s40745-023-00478-0","DOIUrl":"10.1007/s40745-023-00478-0","url":null,"abstract":"<div><p>Every time variance exceeds mean, over-dispersed models are typically employed. This is the reason that over-dispersed models are such an important aspect of statistical modeling. In this work, the parameter of Poisson distribution is assumed to follow a new lifespan distribution called as Chris-Jerry distribution. The resulting compound distribution is an over-dispersed model known as the Poisson-Chris-Jerry distribution. As a result of deriving a general expression for the <i>r th</i> factorial moment, we acquired the moments about origin and the central moments. In addition to this, moment’s related measurements, generating functions, over-dispersion property, reliability characteristics, recurrence relation for probability, and other statistical qualities, have also been described. For the goal of estimating parameter of the suggested model, the maximum likelihood estimation and method of moment estimation have been addressed. The usefulness of maximum likelihood estimates has also been taken into consideration through a simulation study. We employed four real life data sets, examined the goodness-of-fit test, and considered additional standards such as the Akaike’s information criterion and Bayesian information criterion. The outcomes are compared with several potential models.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1799 - 1820"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46822534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utilization of Priori Information in the Estimation of Population Mean for Time-Based Surveys","authors":"Sanjay Kumar, Priyanka Chhaparwal","doi":"10.1007/s40745-023-00472-6","DOIUrl":"10.1007/s40745-023-00472-6","url":null,"abstract":"<div><p>Use of a priori information is very common at an estimation stage to form an estimator of a population parameter. Estimation problems can lead to more accurate and efficient estimates using prior information. In this study, we utilized the information from the past surveys along with the information available from the current surveys in the form of a hybrid exponentially weighted moving average to suggest the estimator of the population mean using a known coefficient of variation of the study variable for time-based surveys. We derived the expression of the mean square error of the suggested estimator and established the mathematical conditions to prove the efficiency of the suggested estimator. The results showed that the utilization of information from past surveys and current surveys excels the estimator's efficiency. A simulation study and a real-life example are provided to support using the suggested estimator.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1675 - 1685"},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45425769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sakib A. Mondal, Prashanth Rv, Sagar Rao, Arun Menon
{"title":"LADDERS: Log Based Anomaly Detection and Diagnosis for Enterprise Systems","authors":"Sakib A. Mondal, Prashanth Rv, Sagar Rao, Arun Menon","doi":"10.1007/s40745-023-00471-7","DOIUrl":"10.1007/s40745-023-00471-7","url":null,"abstract":"<div><p>Enterprise software can fail due to not only malfunction of application servers, but also due to performance degradation or non-availability of other servers or middle layers. Consequently, valuable time and resources are wasted in trying to identify the root cause of software failures. To address this, we have developed a framework called LADDERS. In LADDERS, anomalous incidents are detected from log events generated by various systems and KPIs (Key Performance Indicators) through an ensemble of supervised and unsupervised models. Without transaction identifiers, it is not possible to relate various events from different systems. LADDERS implements Recursive Parallel Causal Discovery (RPCD) to establish causal relationships among log events. The framework builds coresets using BICO to manage high volumes of log data during training and inferencing. An anomaly can cause a number of anomalies throughout the systems. LADDERS makes use of RPCD again to discover causal relationships among these anomalous events. Probable root causes are revealed from the causal graph and anomaly rating of events using a k-shortest path algorithm. We evaluated LADDERS using live logs from an enterprise system. The results demonstrate its effectiveness and efficiency for anomaly detection.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1165 - 1183"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46232475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Jump-Drop Adjusted Prediction of Cumulative Infected Cases Using the Modified SIS Model","authors":"Rashi Mohta, Sravya Prathapani, Palash Ghosh","doi":"10.1007/s40745-023-00467-3","DOIUrl":"10.1007/s40745-023-00467-3","url":null,"abstract":"<div><p>Accurate prediction of cumulative COVID-19 infected cases is essential for effectively managing the limited healthcare resources in India. Historically, epidemiological models have helped in controlling such epidemics. Models require accurate historical data to predict future outcomes. In our data, there were days exhibiting erratic, apparently anomalous jumps and drops in the number of daily reported COVID-19 infected cases that did not conform with the overall trend. Including those observations in the training data would most likely worsen model predictive accuracy. However, with existing epidemiological models it is not straightforward to determine, for a specific day, whether or not an outcome should be considered anomalous. In this work, we propose an algorithm to automatically identify anomalous ‘jump’ and ‘drop’ days, and then based upon the overall trend, the number of daily infected cases for those days is adjusted and the training data is amended using the adjusted observations. We applied the algorithm in conjunction with a recently proposed, modified Susceptible-Infected-Susceptible (SIS) model to demonstrate that prediction accuracy is improved after adjusting training data counts for apparent erratic anomalous jumps and drops.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"959 - 978"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135086225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}