Annals of Data Science最新文献

筛选
英文 中文
Predicting the Functional Changes in Protein Mutations Through the Application of BiLSTM and the Self-Attention Mechanism 通过应用 BiLSTM 和自注意机制预测蛋白质突变的功能变化
Annals of Data Science Pub Date : 2024-04-25 DOI: 10.1007/s40745-024-00530-7
Zixuan Fan, Yan Xu
{"title":"Predicting the Functional Changes in Protein Mutations Through the Application of BiLSTM and the Self-Attention Mechanism","authors":"Zixuan Fan,&nbsp;Yan Xu","doi":"10.1007/s40745-024-00530-7","DOIUrl":"10.1007/s40745-024-00530-7","url":null,"abstract":"<div><p>In the field of bioinformatics, changes in protein functionality are mainly influenced by protein mutations. Accurately predicting these functional changes can enhance our understanding of evolutionary mechanisms, promote developments in protein engineering-related fields, and accelerate progress in medical research. In this study, we introduced two different models: one based on bidirectional long short-term memory (BiLSTM), and the other based on self-attention. These models were integrated using a weighted fusion method to predict protein functional changes associated with mutation sites. The findings indicate that the model's predictive precision matches that of the current model, along with its capacity for generalization. Furthermore, the ensemble model surpasses the performance of the single models, highlighting the value of utilizing their synergistic capabilities. This finding may improve the accuracy of predicting protein functional changes associated with mutations and has potential applications in protein engineering and drug research. We evaluated the efficacy of our models under different scenarios by comparing the predicted results of protein functional changes across various numbers of mutation sites. As the number of mutation sites increases, the prediction accuracy decreases significantly, highlighting the inherent limitations of these models in handling cases involving more mutation sites.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"1077 - 1094"},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140656386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on Intelligent Courses in English Education based on Neural Networks 基于神经网络的英语教育智能课程研究
Annals of Data Science Pub Date : 2024-04-25 DOI: 10.1007/s40745-024-00528-1
Huimin Yao, Haiyan Wang
{"title":"Research on Intelligent Courses in English Education based on Neural Networks","authors":"Huimin Yao,&nbsp;Haiyan Wang","doi":"10.1007/s40745-024-00528-1","DOIUrl":"10.1007/s40745-024-00528-1","url":null,"abstract":"<div><p>Accurately predicting students’ performance plays a crucial role in achieving the intellectualization of courses. This paper studied intelligent courses in English education based on neural networks and designed a firefly algorithm-back propagation neural network (FA-BPNN) method. The correlation between various features and final grades was calculated using the students’ online learning data. Features with higher correlation were selected as the input for the FA-BPNN algorithm to estimate the final score that students achieved in the “College English” course. It was found that the training time of the FA-BPNN algorithm was 3.42 s, the root-mean-square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) values of the FA-BPNN algorithm were 0.986, 0.622, and 0.205, respectively. They were lower than those of the BPNN, genetic algorithm (GA)-BPNN, and particle swarm optimization (PSO)-BPNN algorithms, as well as the adaptive neuro-fuzzy inference system approach. The results indicated the efficacy of the FA for optimizing the parameters of the BPNN algorithm. The comparison between the predicted results and actual values suggested that the average error of the FA-BPNN algorithm was only 0.5, which was the smallest. The experimental results demonstrate the reliability of the FA-BPNN algorithm for performance prediction and its practical application feasibility.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"1095 - 1107"},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140653938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Half Logistic Generalized Rayleigh Distribution for Modeling Hydrological Data 用于水文数据建模的半对数广义瑞利分布
Annals of Data Science Pub Date : 2024-04-18 DOI: 10.1007/s40745-024-00527-2
Adebisi A. Ogunde, Subhankar Dutta, Ehab M. Almetawally
{"title":"Half Logistic Generalized Rayleigh Distribution for Modeling Hydrological Data","authors":"Adebisi A. Ogunde,&nbsp;Subhankar Dutta,&nbsp;Ehab M. Almetawally","doi":"10.1007/s40745-024-00527-2","DOIUrl":"10.1007/s40745-024-00527-2","url":null,"abstract":"<div><p>This article introduced a three-parameter extension of the Generalized Rayleigh distribution called half-logistic Generalized Rayleigh distribution, which has submodels the Generalized Rayleigh and Rayleigh distribution. The proposed model is quite flexible and adaptable to model any kind of life-time data. Its probability density function may sometimes be unimodal and its corresponding hazard rate may be of monotone or non-monotone shape. Standard statistical properties such as it ordinary and incomplete moments, quantile function, moment generating function, reliability function, stochastic ordering, order statistics, Renyi, and <span>({varvec{delta}})</span>-entropy are obtained. The maximum likelihood method is used to obtain the estimates of the model parameters. Two practical examples of hydrological data sets are presented.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"667 - 694"},"PeriodicalIF":0.0,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140686249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-Inflated Zero-Truncated Poisson Distribution: Statistical Properties and Real Life Applications 单充气零截断泊松分布:统计特性与现实应用
Annals of Data Science Pub Date : 2024-04-17 DOI: 10.1007/s40745-024-00526-3
Mohammad Kafeel Wani, Peer Bilal Ahmad
{"title":"One-Inflated Zero-Truncated Poisson Distribution: Statistical Properties and Real Life Applications","authors":"Mohammad Kafeel Wani,&nbsp;Peer Bilal Ahmad","doi":"10.1007/s40745-024-00526-3","DOIUrl":"10.1007/s40745-024-00526-3","url":null,"abstract":"<div><p>Agriculture, engineering, public health, sociology, psychology, and epidemiology are just few of the numerous disciplines that find analysis and modeling of zero-truncated count data to be of paramount importance. Very recently, researchers have been paying careful attention to the one-inflation implications of these zero-truncated count statistics. In this regard, we have studied the one-inflated variant of the zero-truncated Poisson distribution. There are few models within the proposed distribution, which itself is a representation of a two-part process. We have calculated crucial statistical characteristics of the suggested model which are not confined to generating functions, moments and associated measures. The parametric estimation has been carried out using the maximum likelihood estimation. Two different simulation studies have been carried out, one to test the performance of maximum likelihood estimates and the other for testing the compatibility of our devised model when data has been simulated from different competing models with considerably higher mass at point one. For the purpose of testing the compatibility of our proposed model, we have used three real life data sets and considered theoretical as well as graphical performance measures. The fitting results have been compared with some other models of interest. Moreover, we have used three different test statistics viz. Likelihood ratio test, Wald’s test, and Rao’s efficient score test for the purpose of testing the significance of one-inflation parameter.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"639 - 666"},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Improved Boosting Bald Eagle Search Algorithm with Improved African Vultures Optimization Algorithm for Data Clustering 用于数据聚类的改进型秃鹰搜索算法与改进型非洲秃鹰优化算法
Annals of Data Science Pub Date : 2024-04-17 DOI: 10.1007/s40745-024-00525-4
Farhad Soleimanian Gharehchopogh
{"title":"An Improved Boosting Bald Eagle Search Algorithm with Improved African Vultures Optimization Algorithm for Data Clustering","authors":"Farhad Soleimanian Gharehchopogh","doi":"10.1007/s40745-024-00525-4","DOIUrl":"10.1007/s40745-024-00525-4","url":null,"abstract":"<div><p>Data clustering is one of the main issues in the optimization problem. It is the process of clustering a group of items into several groups. Items within each group have the greatest similarity and the least similarity to things in other groups. It is employed in various domains and applications, including biology, business, and consumer analysis, document clustering, web, banking, and image processing, to name a few. In this paper, two new methods are proposed using hybridization of the Bald Eagle Search (BES) Algorithm with the African Vultures Optimization Algorithm (AVOA) (BESAVOA) and BESAVOA with Opposition Based Learning (BESAVOA-OBL) for data clustering. AVOA is used to find the centers of the clusters and improve the centrality of the groups obtained by the BES algorithm. Primary vectors are created based on the population of eagles, and then each vector is used BESAVOA to search the centers of the clusters. The proposed methods (BESAVOA and BESAVOA-OBL) are evaluated on 16 UCI datasets, based on the number of generations, number of iterations, execution time, and convergence. The results show that the BESAVOA-OBL fits better than the other algorithms. The results show that compared to other algorithms, BESAVOA-OBL is more effective by a ratio of 12.42 percent.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"605 - 637"},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140692580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Strategy for Elevated Estimation of Population Mean in Stratified Random Sampling under Linear Cost Function 线性成本函数下分层随机抽样中人口平均值提升估计的最优策略
Annals of Data Science Pub Date : 2024-03-30 DOI: 10.1007/s40745-024-00520-9
Subhash Kumar Yadav, Mukesh Kumar Verma, Rahul Varshney
{"title":"Optimal Strategy for Elevated Estimation of Population Mean in Stratified Random Sampling under Linear Cost Function","authors":"Subhash Kumar Yadav,&nbsp;Mukesh Kumar Verma,&nbsp;Rahul Varshney","doi":"10.1007/s40745-024-00520-9","DOIUrl":"10.1007/s40745-024-00520-9","url":null,"abstract":"<div><p>In this paper, we propose the exponential ratio-type estimator for the elevated estimation of population mean, implying one auxiliary variable in stratified random sampling using the conventional ratio and, Bahl and Tuteja exponential ratio-type estimators. The bias and the Mean Squared Error (MSE) of the proposed estimator are derived up to a first-order approximation and compared with existing estimators. Theoretically, we also compare MSE of the proposed estimator using the linear cost function with the competing estimators. The optimal values of the characterizing scalars are obtained and for these optimal values of characterizing scalars, the minimum MSE is obtained. We find theoretically that the proposed estimator is more efficient than other estimators under restricted conditions by formulating the proposed problem as an optimization problem under linear cost function. The numerical illustration is also included to verify theoretical findings for their practical utility. The estimator with least MSE is recommended for practical utility in different areas of applications of stratified random sampling.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"517 - 538"},"PeriodicalIF":0.0,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-024-00520-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140364077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Key Generation for Privacy Preservation in Big Data Applications Based on the Marine Predator Whale Optimization Algorithm 基于海洋掠食者鲸鱼优化算法的大数据应用隐私保护最佳密钥生成方法
Annals of Data Science Pub Date : 2024-03-20 DOI: 10.1007/s40745-024-00521-8
Poonam Samir Jadhav, Gautam M. Borkar
{"title":"Optimal Key Generation for Privacy Preservation in Big Data Applications Based on the Marine Predator Whale Optimization Algorithm","authors":"Poonam Samir Jadhav,&nbsp;Gautam M. Borkar","doi":"10.1007/s40745-024-00521-8","DOIUrl":"10.1007/s40745-024-00521-8","url":null,"abstract":"<div><p>In the era of big data, preserving data privacy has become paramount due to the sheer volume and sensitivity of the information being processed. This research is dedicated to safeguarding data privacy through a novel data sanitization approach centered on optimal key generation. Due to the size and complexity of the big data applications, managing big data with reduced risk and high privacyposes challenges. Many standard privacy-preserving mechanisms are introduced to maintain the volume and velocity of big data since it consists of massive and complex data. To solve this issue, this research developed a data sanitization technique for optimal key generation to preserve the privacy of the sensitive data. The sensitive data is initially identified by the quasi-identifiers and the identified sensitive data is preserved by generating an optimal key using the proposed marine predator whale optimization (MPWO) algorithm. The proposed algorithm is developed by the hybridization of the characteristics of foraging behaviors of the marine predators and the whales are hybridized to determine the optimal key. The optimal key generated using the MPWO algorithm effectively preserves the privacy of the data. The efficiency of the research is proved by measuring the metrics equivalent class size metric values of 0.03, 185.07, and 0.04 for attribute disclosure attack, identity disclosure attack, and identity disclosure attack. Similarly, the Discernibility metrics value is measured as 0.08, 123.38, 0.09 with attribute disclosure attack, identity disclosure attack, identity disclosure attack, and the Normalized certainty penalty is measured as 0.002, 61.69, 0.001 attribute disclosure attack, identity disclosure attack, identity disclosure attack.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"539 - 569"},"PeriodicalIF":0.0,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140225219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semiparametric Regression Analysis of Panel Count Data with Multiple Modes of Recurrence 对具有多种复现模式的面板计数数据进行半参数回归分析
Annals of Data Science Pub Date : 2024-03-19 DOI: 10.1007/s40745-024-00522-7
Mathew P. M. Ashlin, P. G. Sankaran, E. P. Sreedevi
{"title":"Semiparametric Regression Analysis of Panel Count Data with Multiple Modes of Recurrence","authors":"Mathew P. M. Ashlin,&nbsp;P. G. Sankaran,&nbsp;E. P. Sreedevi","doi":"10.1007/s40745-024-00522-7","DOIUrl":"10.1007/s40745-024-00522-7","url":null,"abstract":"<div><p>Panel count data refers to the information collected in studies focusing on recurrent events, where subjects are observed only at specific time points. If these study subjects are exposed to recurrent events of several types, we obtain panel count data with multiple modes of recurrence. In this article, we present a novel method based on generalized estimating equations for the regression analysis of panel count data exposed to multiple modes of recurrence. A cause specific proportional mean model is developed to analyze the effect of covariates on the underlying counting process due to multiple modes of recurrence. We conduct a detailed investigation on the joint estimation of baseline cumulative mean functions and regression parameters. Simulation studies are carried out to evaluate the finite sample performance of the proposed estimators. The procedures are applied to two real data sets, to demonstrate the practical utility.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"571 - 590"},"PeriodicalIF":0.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140228641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applying BERT-Based NLP for Automated Resume Screening and Candidate Ranking
Annals of Data Science Pub Date : 2024-03-08 DOI: 10.1007/s40745-024-00524-5
Asmita Deshmukh, Anjali Raut
{"title":"Applying BERT-Based NLP for Automated Resume Screening and Candidate Ranking","authors":"Asmita Deshmukh,&nbsp;Anjali Raut","doi":"10.1007/s40745-024-00524-5","DOIUrl":"10.1007/s40745-024-00524-5","url":null,"abstract":"<div><p>In this research, we introduce an innovative automated resume screening approach that leverages advanced Natural Language Processing (NLP) technology, specifically the Bidirectional Encoder Representations from Transformers (BERT) language model by Google. Our methodology involved collecting 200 resumes from participants with their consent and obtaining ten job descriptions from glassdoor.com for testing. We extracted keywords from the resumes, identified skill sets, and ranked them to focus on crucial attributes. After removing stop words and punctuation, we selected top keywords for analysis. To ensure data precision, we employed stemming and lemmatization to correct tense and meaning. Using the preinstalled BERT model and tokenizer, we generated feature vectors for job descriptions and resume keywords. Our key findings include the calculation of the highest similarity index for each resume, which enabled us to shortlist the most relevant candidates. Notably, the similarity index could reach up to 0.3, and the resume screening speed could reach 1 resume per second. The application of BERT-based NLP techniques significantly improved screening efficiency and accuracy, streamlining talent acquisition and providing valuable insights to HR personnel for informed decision-making. This study underscores the transformative potential of BERT in revolutionizing recruitment through scalable and powerful automated resume screening, demonstrating its efficacy in enhancing the precision and speed of candidate selection.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 2","pages":"591 - 603"},"PeriodicalIF":0.0,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Inference for the Entropy of the Rayleigh Model Based on Ordered Ranked Set Sampling 基于有序排序集合采样的雷利模型熵的贝叶斯推断
Annals of Data Science Pub Date : 2024-02-27 DOI: 10.1007/s40745-024-00514-7
Mohammed S. Kotb, Haidy A. Newer, Marwa M. Mohie El-Din
{"title":"Bayesian Inference for the Entropy of the Rayleigh Model Based on Ordered Ranked Set Sampling","authors":"Mohammed S. Kotb,&nbsp;Haidy A. Newer,&nbsp;Marwa M. Mohie El-Din","doi":"10.1007/s40745-024-00514-7","DOIUrl":"10.1007/s40745-024-00514-7","url":null,"abstract":"<div><p>Recently, ranked set samples schemes have become quite popular in reliability analysis and life-testing problems. Based on ordered ranked set sample, the Bayesian estimators and credible intervals for the entropy of the Rayleigh model are studied and compared with the corresponding estimators based on simple random sampling. These Bayes estimators for entropy are developed and computed with various loss functions, such as square error, linear-exponential, Al-Bayyati, and general entropy loss functions. A comparison study for various estimates of entropy based on mean squared error is done. A real-life data set and simulation are applied to illustrate our procedures.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1435 - 1458"},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140427345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信