{"title":"Predicting the Functional Changes in Protein Mutations Through the Application of BiLSTM and the Self-Attention Mechanism","authors":"Zixuan Fan, Yan Xu","doi":"10.1007/s40745-024-00530-7","DOIUrl":"10.1007/s40745-024-00530-7","url":null,"abstract":"<div><p>In the field of bioinformatics, changes in protein functionality are mainly influenced by protein mutations. Accurately predicting these functional changes can enhance our understanding of evolutionary mechanisms, promote developments in protein engineering-related fields, and accelerate progress in medical research. In this study, we introduced two different models: one based on bidirectional long short-term memory (BiLSTM), and the other based on self-attention. These models were integrated using a weighted fusion method to predict protein functional changes associated with mutation sites. The findings indicate that the model's predictive precision matches that of the current model, along with its capacity for generalization. Furthermore, the ensemble model surpasses the performance of the single models, highlighting the value of utilizing their synergistic capabilities. This finding may improve the accuracy of predicting protein functional changes associated with mutations and has potential applications in protein engineering and drug research. We evaluated the efficacy of our models under different scenarios by comparing the predicted results of protein functional changes across various numbers of mutation sites. As the number of mutation sites increases, the prediction accuracy decreases significantly, highlighting the inherent limitations of these models in handling cases involving more mutation sites.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"1077 - 1094"},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140656386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on Intelligent Courses in English Education based on Neural Networks","authors":"Huimin Yao, Haiyan Wang","doi":"10.1007/s40745-024-00528-1","DOIUrl":"10.1007/s40745-024-00528-1","url":null,"abstract":"<div><p>Accurately predicting students’ performance plays a crucial role in achieving the intellectualization of courses. This paper studied intelligent courses in English education based on neural networks and designed a firefly algorithm-back propagation neural network (FA-BPNN) method. The correlation between various features and final grades was calculated using the students’ online learning data. Features with higher correlation were selected as the input for the FA-BPNN algorithm to estimate the final score that students achieved in the “College English” course. It was found that the training time of the FA-BPNN algorithm was 3.42 s, the root-mean-square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) values of the FA-BPNN algorithm were 0.986, 0.622, and 0.205, respectively. They were lower than those of the BPNN, genetic algorithm (GA)-BPNN, and particle swarm optimization (PSO)-BPNN algorithms, as well as the adaptive neuro-fuzzy inference system approach. The results indicated the efficacy of the FA for optimizing the parameters of the BPNN algorithm. The comparison between the predicted results and actual values suggested that the average error of the FA-BPNN algorithm was only 0.5, which was the smallest. The experimental results demonstrate the reliability of the FA-BPNN algorithm for performance prediction and its practical application feasibility.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"1095 - 1107"},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140653938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adebisi A. Ogunde, Subhankar Dutta, Ehab M. Almetawally
{"title":"Half Logistic Generalized Rayleigh Distribution for Modeling Hydrological Data","authors":"Adebisi A. Ogunde, Subhankar Dutta, Ehab M. Almetawally","doi":"10.1007/s40745-024-00527-2","DOIUrl":"10.1007/s40745-024-00527-2","url":null,"abstract":"<div><p>This article introduced a three-parameter extension of the Generalized Rayleigh distribution called half-logistic Generalized Rayleigh distribution, which has submodels the Generalized Rayleigh and Rayleigh distribution. The proposed model is quite flexible and adaptable to model any kind of life-time data. Its probability density function may sometimes be unimodal and its corresponding hazard rate may be of monotone or non-monotone shape. Standard statistical properties such as it ordinary and incomplete moments, quantile function, moment generating function, reliability function, stochastic ordering, order statistics, Renyi, and <span>({varvec{delta}})</span>-entropy are obtained. The maximum likelihood method is used to obtain the estimates of the model parameters. Two practical examples of hydrological data sets are presented.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"667 - 694"},"PeriodicalIF":0.0,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140686249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One-Inflated Zero-Truncated Poisson Distribution: Statistical Properties and Real Life Applications","authors":"Mohammad Kafeel Wani, Peer Bilal Ahmad","doi":"10.1007/s40745-024-00526-3","DOIUrl":"10.1007/s40745-024-00526-3","url":null,"abstract":"<div><p>Agriculture, engineering, public health, sociology, psychology, and epidemiology are just few of the numerous disciplines that find analysis and modeling of zero-truncated count data to be of paramount importance. Very recently, researchers have been paying careful attention to the one-inflation implications of these zero-truncated count statistics. In this regard, we have studied the one-inflated variant of the zero-truncated Poisson distribution. There are few models within the proposed distribution, which itself is a representation of a two-part process. We have calculated crucial statistical characteristics of the suggested model which are not confined to generating functions, moments and associated measures. The parametric estimation has been carried out using the maximum likelihood estimation. Two different simulation studies have been carried out, one to test the performance of maximum likelihood estimates and the other for testing the compatibility of our devised model when data has been simulated from different competing models with considerably higher mass at point one. For the purpose of testing the compatibility of our proposed model, we have used three real life data sets and considered theoretical as well as graphical performance measures. The fitting results have been compared with some other models of interest. Moreover, we have used three different test statistics viz. Likelihood ratio test, Wald’s test, and Rao’s efficient score test for the purpose of testing the significance of one-inflation parameter.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"639 - 666"},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Improved Boosting Bald Eagle Search Algorithm with Improved African Vultures Optimization Algorithm for Data Clustering","authors":"Farhad Soleimanian Gharehchopogh","doi":"10.1007/s40745-024-00525-4","DOIUrl":"10.1007/s40745-024-00525-4","url":null,"abstract":"<div><p>Data clustering is one of the main issues in the optimization problem. It is the process of clustering a group of items into several groups. Items within each group have the greatest similarity and the least similarity to things in other groups. It is employed in various domains and applications, including biology, business, and consumer analysis, document clustering, web, banking, and image processing, to name a few. In this paper, two new methods are proposed using hybridization of the Bald Eagle Search (BES) Algorithm with the African Vultures Optimization Algorithm (AVOA) (BESAVOA) and BESAVOA with Opposition Based Learning (BESAVOA-OBL) for data clustering. AVOA is used to find the centers of the clusters and improve the centrality of the groups obtained by the BES algorithm. Primary vectors are created based on the population of eagles, and then each vector is used BESAVOA to search the centers of the clusters. The proposed methods (BESAVOA and BESAVOA-OBL) are evaluated on 16 UCI datasets, based on the number of generations, number of iterations, execution time, and convergence. The results show that the BESAVOA-OBL fits better than the other algorithms. The results show that compared to other algorithms, BESAVOA-OBL is more effective by a ratio of 12.42 percent.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"605 - 637"},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140692580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Strategy for Elevated Estimation of Population Mean in Stratified Random Sampling under Linear Cost Function","authors":"Subhash Kumar Yadav, Mukesh Kumar Verma, Rahul Varshney","doi":"10.1007/s40745-024-00520-9","DOIUrl":"10.1007/s40745-024-00520-9","url":null,"abstract":"<div><p>In this paper, we propose the exponential ratio-type estimator for the elevated estimation of population mean, implying one auxiliary variable in stratified random sampling using the conventional ratio and, Bahl and Tuteja exponential ratio-type estimators. The bias and the Mean Squared Error (MSE) of the proposed estimator are derived up to a first-order approximation and compared with existing estimators. Theoretically, we also compare MSE of the proposed estimator using the linear cost function with the competing estimators. The optimal values of the characterizing scalars are obtained and for these optimal values of characterizing scalars, the minimum MSE is obtained. We find theoretically that the proposed estimator is more efficient than other estimators under restricted conditions by formulating the proposed problem as an optimization problem under linear cost function. The numerical illustration is also included to verify theoretical findings for their practical utility. The estimator with least MSE is recommended for practical utility in different areas of applications of stratified random sampling.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"517 - 538"},"PeriodicalIF":0.0,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-024-00520-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140364077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Key Generation for Privacy Preservation in Big Data Applications Based on the Marine Predator Whale Optimization Algorithm","authors":"Poonam Samir Jadhav, Gautam M. Borkar","doi":"10.1007/s40745-024-00521-8","DOIUrl":"10.1007/s40745-024-00521-8","url":null,"abstract":"<div><p>In the era of big data, preserving data privacy has become paramount due to the sheer volume and sensitivity of the information being processed. This research is dedicated to safeguarding data privacy through a novel data sanitization approach centered on optimal key generation. Due to the size and complexity of the big data applications, managing big data with reduced risk and high privacyposes challenges. Many standard privacy-preserving mechanisms are introduced to maintain the volume and velocity of big data since it consists of massive and complex data. To solve this issue, this research developed a data sanitization technique for optimal key generation to preserve the privacy of the sensitive data. The sensitive data is initially identified by the quasi-identifiers and the identified sensitive data is preserved by generating an optimal key using the proposed marine predator whale optimization (MPWO) algorithm. The proposed algorithm is developed by the hybridization of the characteristics of foraging behaviors of the marine predators and the whales are hybridized to determine the optimal key. The optimal key generated using the MPWO algorithm effectively preserves the privacy of the data. The efficiency of the research is proved by measuring the metrics equivalent class size metric values of 0.03, 185.07, and 0.04 for attribute disclosure attack, identity disclosure attack, and identity disclosure attack. Similarly, the Discernibility metrics value is measured as 0.08, 123.38, 0.09 with attribute disclosure attack, identity disclosure attack, identity disclosure attack, and the Normalized certainty penalty is measured as 0.002, 61.69, 0.001 attribute disclosure attack, identity disclosure attack, identity disclosure attack.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"539 - 569"},"PeriodicalIF":0.0,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140225219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathew P. M. Ashlin, P. G. Sankaran, E. P. Sreedevi
{"title":"Semiparametric Regression Analysis of Panel Count Data with Multiple Modes of Recurrence","authors":"Mathew P. M. Ashlin, P. G. Sankaran, E. P. Sreedevi","doi":"10.1007/s40745-024-00522-7","DOIUrl":"10.1007/s40745-024-00522-7","url":null,"abstract":"<div><p>Panel count data refers to the information collected in studies focusing on recurrent events, where subjects are observed only at specific time points. If these study subjects are exposed to recurrent events of several types, we obtain panel count data with multiple modes of recurrence. In this article, we present a novel method based on generalized estimating equations for the regression analysis of panel count data exposed to multiple modes of recurrence. A cause specific proportional mean model is developed to analyze the effect of covariates on the underlying counting process due to multiple modes of recurrence. We conduct a detailed investigation on the joint estimation of baseline cumulative mean functions and regression parameters. Simulation studies are carried out to evaluate the finite sample performance of the proposed estimators. The procedures are applied to two real data sets, to demonstrate the practical utility.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"571 - 590"},"PeriodicalIF":0.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140228641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying BERT-Based NLP for Automated Resume Screening and Candidate Ranking","authors":"Asmita Deshmukh, Anjali Raut","doi":"10.1007/s40745-024-00524-5","DOIUrl":"10.1007/s40745-024-00524-5","url":null,"abstract":"<div><p>In this research, we introduce an innovative automated resume screening approach that leverages advanced Natural Language Processing (NLP) technology, specifically the Bidirectional Encoder Representations from Transformers (BERT) language model by Google. Our methodology involved collecting 200 resumes from participants with their consent and obtaining ten job descriptions from glassdoor.com for testing. We extracted keywords from the resumes, identified skill sets, and ranked them to focus on crucial attributes. After removing stop words and punctuation, we selected top keywords for analysis. To ensure data precision, we employed stemming and lemmatization to correct tense and meaning. Using the preinstalled BERT model and tokenizer, we generated feature vectors for job descriptions and resume keywords. Our key findings include the calculation of the highest similarity index for each resume, which enabled us to shortlist the most relevant candidates. Notably, the similarity index could reach up to 0.3, and the resume screening speed could reach 1 resume per second. The application of BERT-based NLP techniques significantly improved screening efficiency and accuracy, streamlining talent acquisition and providing valuable insights to HR personnel for informed decision-making. This study underscores the transformative potential of BERT in revolutionizing recruitment through scalable and powerful automated resume screening, demonstrating its efficacy in enhancing the precision and speed of candidate selection.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"591 - 603"},"PeriodicalIF":0.0,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed S. Kotb, Haidy A. Newer, Marwa M. Mohie El-Din
{"title":"Bayesian Inference for the Entropy of the Rayleigh Model Based on Ordered Ranked Set Sampling","authors":"Mohammed S. Kotb, Haidy A. Newer, Marwa M. Mohie El-Din","doi":"10.1007/s40745-024-00514-7","DOIUrl":"10.1007/s40745-024-00514-7","url":null,"abstract":"<div><p>Recently, ranked set samples schemes have become quite popular in reliability analysis and life-testing problems. Based on ordered ranked set sample, the Bayesian estimators and credible intervals for the entropy of the Rayleigh model are studied and compared with the corresponding estimators based on simple random sampling. These Bayes estimators for entropy are developed and computed with various loss functions, such as square error, linear-exponential, Al-Bayyati, and general entropy loss functions. A comparison study for various estimates of entropy based on mean squared error is done. A real-life data set and simulation are applied to illustrate our procedures.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1435 - 1458"},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140427345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}