Research on early risk predictive model and discriminative feature selection of cancer based on real-world routine physical examination data

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2016-12-01 DOI:10.1109/BIBM.2016.7822746

Guixia Kang, Zhuang Ni

{"title":"Research on early risk predictive model and discriminative feature selection of cancer based on real-world routine physical examination data","authors":"Guixia Kang, Zhuang Ni","doi":"10.1109/BIBM.2016.7822746","DOIUrl":null,"url":null,"abstract":"most cancers at early stages show no obvious symptoms and curative treatment is not an option any more when cancer is diagnosed. Therefore, making accurate predictions for the risk of early cancer has become urgently necessary in the field of medicine. In this paper, our purpose is to fully utilize real-world routine physical examination data to analyze the most discriminative features of cancer based on ReliefF algorithm and generate early risk predictive model of cancer taking advantage of three machine learning (ML) algorithms. We use physical examination data with a return visit followed 1 month later derived from CiMing Health Checkup Center. The ReliefF algorithm selects the top 30 features written as Sub(30) based on weight value from our data collections consisting of 34 features and 2300 candidates. The 4-layer (2 hidden layers) deep neutral network (DNN) based on B-P algorithm, the support machine vector with the linear kernel and decision tree CART are proposed for predicting the risk of cancer by 5-fold cross validation. We implement these criteria such as predictive accuracy, AUC-ROC, sensitivity and specificity to identify the discriminative ability of three proposed method for cancer. The results show that compared with the other two methods, SVM obtains higher AUC and specificity of 0.926 and 95.27%, respectively. The superior predictive accuracy (86%) is achieved by DNN. Moreover, the fuzzy interval of threshold in DNN is proposed and the sensitivity, specificity and accuracy of DNN is 90.20%, 94.22% and 93.22%, respectively, using the revised threshold interval. The research indicates that the application of ML methods together with risk feature selection based on real-world routine physical examination data is meaningful and promising in the area of cancer prediction.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2016.7822746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

most cancers at early stages show no obvious symptoms and curative treatment is not an option any more when cancer is diagnosed. Therefore, making accurate predictions for the risk of early cancer has become urgently necessary in the field of medicine. In this paper, our purpose is to fully utilize real-world routine physical examination data to analyze the most discriminative features of cancer based on ReliefF algorithm and generate early risk predictive model of cancer taking advantage of three machine learning (ML) algorithms. We use physical examination data with a return visit followed 1 month later derived from CiMing Health Checkup Center. The ReliefF algorithm selects the top 30 features written as Sub(30) based on weight value from our data collections consisting of 34 features and 2300 candidates. The 4-layer (2 hidden layers) deep neutral network (DNN) based on B-P algorithm, the support machine vector with the linear kernel and decision tree CART are proposed for predicting the risk of cancer by 5-fold cross validation. We implement these criteria such as predictive accuracy, AUC-ROC, sensitivity and specificity to identify the discriminative ability of three proposed method for cancer. The results show that compared with the other two methods, SVM obtains higher AUC and specificity of 0.926 and 95.27%, respectively. The superior predictive accuracy (86%) is achieved by DNN. Moreover, the fuzzy interval of threshold in DNN is proposed and the sensitivity, specificity and accuracy of DNN is 90.20%, 94.22% and 93.22%, respectively, using the revised threshold interval. The research indicates that the application of ML methods together with risk feature selection based on real-world routine physical examination data is meaningful and promising in the area of cancer prediction.

查看原文本刊更多论文

基于真实体检数据的癌症早期风险预测模型及判别特征选择研究

大多数癌症在早期阶段没有明显的症状，当癌症被诊断出来时，治愈性治疗不再是一种选择。因此，对早期癌症的风险进行准确的预测已成为医学领域的迫切需要。在本文中，我们的目的是充分利用真实世界的常规体检数据，基于ReliefF算法分析癌症最具判别性的特征，并利用三种机器学习(ML)算法生成癌症早期风险预测模型。我们使用慈明健康体检中心1个月后复诊的体检数据。ReliefF算法根据权重值从我们的数据集合(包含34个特征和2300个候选特征)中选择前30个写为Sub(30)的特征。提出了基于B-P算法的4层(2层隐藏层)深度神经网络(DNN)、线性核支持机向量和决策树CART，通过5次交叉验证预测癌症风险。我们运用预测准确度、AUC-ROC、敏感性和特异性等标准来鉴定三种方法对癌症的鉴别能力。结果表明，与其他两种方法相比，SVM的AUC和特异度分别为0.926和95.27%。DNN的预测准确率高达86%。提出了深度神经网络阈值的模糊区间，采用修正后的阈值区间，深度神经网络的灵敏度、特异性和准确性分别为90.20%、94.22%和93.22%。研究表明，将机器学习方法与基于真实常规体检数据的风险特征选择相结合，在癌症预测领域具有重要意义和前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量