基于改进贝叶斯的疾病文献分类方法

2016 World Symposium on Computer Applications & Research (WSCAR) Pub Date : 2016-03-01 DOI:10.1109/WSCAR.2016.26

H. Al-Mubaid, Mohamed Shenify

{"title":"基于改进贝叶斯的疾病文献分类方法","authors":"H. Al-Mubaid, Mohamed Shenify","doi":"10.1109/WSCAR.2016.26","DOIUrl":null,"url":null,"abstract":"Naïve Bayes has been proved to be decently competitive learning and classification approach in many fields and still been actively researched. We propose a Bayesian based classification method for biomedical disease-related documents. The proposed method relies on the difference in class distribution between the presence vs absence of the attributes. Specifically, in a simple inductive learning setting, the difference in class probability between the presence vs absence of feature fj can be a good metric for the contribution of fj in predicting the class. The proposed method works well with biomedical text abstracts as attribute values (feature count) of word features are not high. We found that heavy medical terms tends to occur with fairly low frequencies in these abstracts but have significant contribution in determining the class and the subject of the document. Therefore, this technique is suitable for biomedical text mining because it gives rise to terms with low per-document frequency and such terms play a good role in predicting the class in biomedical texts. The evaluation is conducted with seven datasets and compared to the Bayesian method as our baseline using accuracy and AUC with encouraging results, and the proposed method outperformed the baseline significantly. Moreover, we investigated the effect of low average frequency terms and their contribution in classification accuracy.","PeriodicalId":412982,"journal":{"name":"2016 World Symposium on Computer Applications & Research (WSCAR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improved Bayesian Based Method for Classifying Disease Documents\",\"authors\":\"H. Al-Mubaid, Mohamed Shenify\",\"doi\":\"10.1109/WSCAR.2016.26\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Naïve Bayes has been proved to be decently competitive learning and classification approach in many fields and still been actively researched. We propose a Bayesian based classification method for biomedical disease-related documents. The proposed method relies on the difference in class distribution between the presence vs absence of the attributes. Specifically, in a simple inductive learning setting, the difference in class probability between the presence vs absence of feature fj can be a good metric for the contribution of fj in predicting the class. The proposed method works well with biomedical text abstracts as attribute values (feature count) of word features are not high. We found that heavy medical terms tends to occur with fairly low frequencies in these abstracts but have significant contribution in determining the class and the subject of the document. Therefore, this technique is suitable for biomedical text mining because it gives rise to terms with low per-document frequency and such terms play a good role in predicting the class in biomedical texts. The evaluation is conducted with seven datasets and compared to the Bayesian method as our baseline using accuracy and AUC with encouraging results, and the proposed method outperformed the baseline significantly. Moreover, we investigated the effect of low average frequency terms and their contribution in classification accuracy.\",\"PeriodicalId\":412982,\"journal\":{\"name\":\"2016 World Symposium on Computer Applications & Research (WSCAR)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 World Symposium on Computer Applications & Research (WSCAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WSCAR.2016.26\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 World Symposium on Computer Applications & Research (WSCAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSCAR.2016.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

Naïve贝叶斯在许多领域被证明是一种很好的竞争性学习和分类方法，并且仍在积极研究中。提出了一种基于贝叶斯的生物医学疾病相关文献分类方法。所提出的方法依赖于属性存在与不存在之间类分布的差异。具体来说，在简单的归纳学习设置中，特征fj的存在与不存在之间的类概率差异可以作为fj在预测类中的贡献的一个很好的度量。在生物医学文本摘要中，由于单词特征的属性值(特征数)不高，该方法可以很好地处理。我们发现，在这些摘要中，重医学术语往往以相当低的频率出现，但在确定文档的类别和主题方面有重大贡献。因此，该技术适用于生物医学文本挖掘，因为它产生了低单文档频率的术语，并且这些术语在生物医学文本中的分类预测中发挥了很好的作用。使用7个数据集进行评估，并将准确率和AUC与贝叶斯方法作为基线进行比较，结果令人鼓舞，所提出的方法明显优于基线。此外，我们还研究了低平均频率项的影响及其对分类精度的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improved Bayesian Based Method for Classifying Disease Documents

Naïve Bayes has been proved to be decently competitive learning and classification approach in many fields and still been actively researched. We propose a Bayesian based classification method for biomedical disease-related documents. The proposed method relies on the difference in class distribution between the presence vs absence of the attributes. Specifically, in a simple inductive learning setting, the difference in class probability between the presence vs absence of feature fj can be a good metric for the contribution of fj in predicting the class. The proposed method works well with biomedical text abstracts as attribute values (feature count) of word features are not high. We found that heavy medical terms tends to occur with fairly low frequencies in these abstracts but have significant contribution in determining the class and the subject of the document. Therefore, this technique is suitable for biomedical text mining because it gives rise to terms with low per-document frequency and such terms play a good role in predicting the class in biomedical texts. The evaluation is conducted with seven datasets and compared to the Bayesian method as our baseline using accuracy and AUC with encouraging results, and the proposed method outperformed the baseline significantly. Moreover, we investigated the effect of low average frequency terms and their contribution in classification accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 World Symposium on Computer Applications & Research (WSCAR)

自引率

0.00%

发文量