Support Vector Machine based Breast Cancer Classification using Next Generation Sequences

Babymol Kurian, V. Jyothi
{"title":"Support Vector Machine based Breast Cancer Classification using Next Generation Sequences","authors":"Babymol Kurian, V. Jyothi","doi":"10.4108/EAI.16-5-2020.2303953","DOIUrl":null,"url":null,"abstract":". Next Generation Sequencing is inevitable for providing better approach for predicting and curing diseases with high success rate in an appreciable timeline. Modern technology such as machine learning support the medical research with high speed and tremendous accuracy from disease prediction to cure. In this paper, the supervised learning model, Support Vector Machine is applied on next generation sequences for the prediction of breast cancer. Ten basic features of DNA sequences such as individual nucleobase average count of A, G, C, T, AT and GC-content, AT/GC composition, G-Quadruplex occurrence, ORF (Open Reading Frame) count and MR (Mutation Rate) are used for framing the feature vector. The feature vectors along with the class value are considered as the dataset for supervised learning. Datasets are prepared to classify (class value) as ‘0’ for normal sequences, ‘1’ for BRCA1 cancer sequences and ‘2’ for BRCA2 cancer sequences. Four different categories of datasets are prepared with 50, 100, 150 and 200 sequences for each class of normal sequence, BRCA1 and BRCA2 cancer sequence. While increasing the dataset size, the outlier, the distribution and scattered features of data were also analysed. The datasets are split into training and testing set with 80:20 ratio for the classification process. SVM model in Python is applied for supervised classification process.","PeriodicalId":274686,"journal":{"name":"Proceedings of the Fist International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fist International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/EAI.16-5-2020.2303953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

. Next Generation Sequencing is inevitable for providing better approach for predicting and curing diseases with high success rate in an appreciable timeline. Modern technology such as machine learning support the medical research with high speed and tremendous accuracy from disease prediction to cure. In this paper, the supervised learning model, Support Vector Machine is applied on next generation sequences for the prediction of breast cancer. Ten basic features of DNA sequences such as individual nucleobase average count of A, G, C, T, AT and GC-content, AT/GC composition, G-Quadruplex occurrence, ORF (Open Reading Frame) count and MR (Mutation Rate) are used for framing the feature vector. The feature vectors along with the class value are considered as the dataset for supervised learning. Datasets are prepared to classify (class value) as ‘0’ for normal sequences, ‘1’ for BRCA1 cancer sequences and ‘2’ for BRCA2 cancer sequences. Four different categories of datasets are prepared with 50, 100, 150 and 200 sequences for each class of normal sequence, BRCA1 and BRCA2 cancer sequence. While increasing the dataset size, the outlier, the distribution and scattered features of data were also analysed. The datasets are split into training and testing set with 80:20 ratio for the classification process. SVM model in Python is applied for supervised classification process.
基于支持向量机的下一代序列乳腺癌分类
. 在较短的时间内以较高的成功率提供更好的疾病预测和治疗方法,下一代测序是不可避免的。机器学习等现代技术为医学研究提供了从疾病预测到治疗的高速和极高的准确性。本文将有监督学习模型——支持向量机应用于下一代序列的乳腺癌预测。DNA序列的10个基本特征,如单个核碱基A、G、C、T、AT和GC含量的平均计数,AT/GC组成,G-四重体发生率,ORF (Open Reading Frame)计数和MR(突变率)用于构建特征向量。将特征向量与类值一起作为监督学习的数据集。数据集准备分类(类值)为正常序列为“0”,BRCA1癌症序列为“1”,BRCA2癌症序列为“2”。为正常序列、BRCA1和BRCA2癌症序列的每一类准备了50、100、150和200个序列的四种不同类型的数据集。在增加数据集规模的同时,还分析了数据的离群值、分布和分散特征。将数据集按80:20的比例分成训练集和测试集进行分类。将Python中的SVM模型应用于监督分类过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信