A. Choudhary, Preeti Jha, Aruna Tiwari, Neha Bharill, M. Ratnaparkhe
{"title":"基于Apache Spark的可扩展模糊聚类回归预测植物蛋白序列等电点","authors":"A. Choudhary, Preeti Jha, Aruna Tiwari, Neha Bharill, M. Ratnaparkhe","doi":"10.1109/FUZZ45933.2021.9494447","DOIUrl":null,"url":null,"abstract":"Learning in non-stationary environments require modern tools and algorithms to quickly adapt to the new pattern because concept drift can change the underlying distribution. So, the existing assumption that the data is independent and identically distributed may be invalid in data stream scenarios. Given the massive volume of high-speed data streams and the concept drift, traditional machine learning algorithms must be self-adapting. One of the difficulties in handling regression tasks is the complexities of equations for the regression models when combined with drift handling techniques. The high dimensional protein data is a major challenge for bioinformatics researchers to analyse the dynamics of the sequences. This paper proposes a Scalable Fuzzy Clustering induced Regression (SFC-R) algorithm to predict the isoelectric point of the plant protein sequences using Apache Spark clusters. The SFC-R algorithm uses the input features extracted from the plant protein sequences and validates performance in terms of mean squared error (MAE) and root-mean-square error (RMSE). Experiments on plant protein datasets are carried out to validate the high accuracy and robustness of our approach.","PeriodicalId":151289,"journal":{"name":"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable Fuzzy Clustering-based Regression to Predict the Isoelectric Points of the Plant Protein Sequences using Apache Spark\",\"authors\":\"A. Choudhary, Preeti Jha, Aruna Tiwari, Neha Bharill, M. Ratnaparkhe\",\"doi\":\"10.1109/FUZZ45933.2021.9494447\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning in non-stationary environments require modern tools and algorithms to quickly adapt to the new pattern because concept drift can change the underlying distribution. So, the existing assumption that the data is independent and identically distributed may be invalid in data stream scenarios. Given the massive volume of high-speed data streams and the concept drift, traditional machine learning algorithms must be self-adapting. One of the difficulties in handling regression tasks is the complexities of equations for the regression models when combined with drift handling techniques. The high dimensional protein data is a major challenge for bioinformatics researchers to analyse the dynamics of the sequences. This paper proposes a Scalable Fuzzy Clustering induced Regression (SFC-R) algorithm to predict the isoelectric point of the plant protein sequences using Apache Spark clusters. The SFC-R algorithm uses the input features extracted from the plant protein sequences and validates performance in terms of mean squared error (MAE) and root-mean-square error (RMSE). Experiments on plant protein datasets are carried out to validate the high accuracy and robustness of our approach.\",\"PeriodicalId\":151289,\"journal\":{\"name\":\"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FUZZ45933.2021.9494447\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FUZZ45933.2021.9494447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scalable Fuzzy Clustering-based Regression to Predict the Isoelectric Points of the Plant Protein Sequences using Apache Spark
Learning in non-stationary environments require modern tools and algorithms to quickly adapt to the new pattern because concept drift can change the underlying distribution. So, the existing assumption that the data is independent and identically distributed may be invalid in data stream scenarios. Given the massive volume of high-speed data streams and the concept drift, traditional machine learning algorithms must be self-adapting. One of the difficulties in handling regression tasks is the complexities of equations for the regression models when combined with drift handling techniques. The high dimensional protein data is a major challenge for bioinformatics researchers to analyse the dynamics of the sequences. This paper proposes a Scalable Fuzzy Clustering induced Regression (SFC-R) algorithm to predict the isoelectric point of the plant protein sequences using Apache Spark clusters. The SFC-R algorithm uses the input features extracted from the plant protein sequences and validates performance in terms of mean squared error (MAE) and root-mean-square error (RMSE). Experiments on plant protein datasets are carried out to validate the high accuracy and robustness of our approach.