A. Choudhary, Preeti Jha, Aruna Tiwari, Neha Bharill, M. Ratnaparkhe
{"title":"Scalable Fuzzy Clustering-based Regression to Predict the Isoelectric Points of the Plant Protein Sequences using Apache Spark","authors":"A. Choudhary, Preeti Jha, Aruna Tiwari, Neha Bharill, M. Ratnaparkhe","doi":"10.1109/FUZZ45933.2021.9494447","DOIUrl":null,"url":null,"abstract":"Learning in non-stationary environments require modern tools and algorithms to quickly adapt to the new pattern because concept drift can change the underlying distribution. So, the existing assumption that the data is independent and identically distributed may be invalid in data stream scenarios. Given the massive volume of high-speed data streams and the concept drift, traditional machine learning algorithms must be self-adapting. One of the difficulties in handling regression tasks is the complexities of equations for the regression models when combined with drift handling techniques. The high dimensional protein data is a major challenge for bioinformatics researchers to analyse the dynamics of the sequences. This paper proposes a Scalable Fuzzy Clustering induced Regression (SFC-R) algorithm to predict the isoelectric point of the plant protein sequences using Apache Spark clusters. The SFC-R algorithm uses the input features extracted from the plant protein sequences and validates performance in terms of mean squared error (MAE) and root-mean-square error (RMSE). Experiments on plant protein datasets are carried out to validate the high accuracy and robustness of our approach.","PeriodicalId":151289,"journal":{"name":"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FUZZ45933.2021.9494447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Learning in non-stationary environments require modern tools and algorithms to quickly adapt to the new pattern because concept drift can change the underlying distribution. So, the existing assumption that the data is independent and identically distributed may be invalid in data stream scenarios. Given the massive volume of high-speed data streams and the concept drift, traditional machine learning algorithms must be self-adapting. One of the difficulties in handling regression tasks is the complexities of equations for the regression models when combined with drift handling techniques. The high dimensional protein data is a major challenge for bioinformatics researchers to analyse the dynamics of the sequences. This paper proposes a Scalable Fuzzy Clustering induced Regression (SFC-R) algorithm to predict the isoelectric point of the plant protein sequences using Apache Spark clusters. The SFC-R algorithm uses the input features extracted from the plant protein sequences and validates performance in terms of mean squared error (MAE) and root-mean-square error (RMSE). Experiments on plant protein datasets are carried out to validate the high accuracy and robustness of our approach.