Else-Tree Classifier for Minimizing Misclassification of Biological Data

Truong X. Tran, M. Pusey, R. S. Aygün
{"title":"Else-Tree Classifier for Minimizing Misclassification of Biological Data","authors":"Truong X. Tran, M. Pusey, R. S. Aygün","doi":"10.1109/BIBM.2018.8621322","DOIUrl":null,"url":null,"abstract":"Misclassification has a high cost in biological research studies such as protein crystallization. For drug development, the 3D structure of a protein is obtained by first crystallizing the protein. Hence, missing a crystalline condition may hinder the development of a drug. It is important to develop classification algorithms that would avoid or minimize misclassifications. Traditional decision tree classifiers are based on an impurity measure that identifies the most informative attribute to be selected at the early levels of a decision tree. The class labels are chosen based on majority of class labels at a leaf node. We introduce a novel decision tree classifier, else-tree, by analyzing pure regions or ranges of an attribute per class. After identifying the longest or most populated contiguous range per class, the rest of the ranges are fed into else branch of the decision tree. Only conflicting or doubtful samples are passed to the lower levels of the decision tree. It does not necessarily assign a class for difficult samples to classify. We have used our protein crystallization trials data and three other publicly available datasets to evaluate else-tree. The experiments show that the else-tree may reduce the misclassification to 0% by labeling difficult samples as undecided when the training set is a good representation of the dataset.","PeriodicalId":108667,"journal":{"name":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2018.8621322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Misclassification has a high cost in biological research studies such as protein crystallization. For drug development, the 3D structure of a protein is obtained by first crystallizing the protein. Hence, missing a crystalline condition may hinder the development of a drug. It is important to develop classification algorithms that would avoid or minimize misclassifications. Traditional decision tree classifiers are based on an impurity measure that identifies the most informative attribute to be selected at the early levels of a decision tree. The class labels are chosen based on majority of class labels at a leaf node. We introduce a novel decision tree classifier, else-tree, by analyzing pure regions or ranges of an attribute per class. After identifying the longest or most populated contiguous range per class, the rest of the ranges are fed into else branch of the decision tree. Only conflicting or doubtful samples are passed to the lower levels of the decision tree. It does not necessarily assign a class for difficult samples to classify. We have used our protein crystallization trials data and three other publicly available datasets to evaluate else-tree. The experiments show that the else-tree may reduce the misclassification to 0% by labeling difficult samples as undecided when the training set is a good representation of the dataset.
最小化生物数据误分类的Else-Tree分类器
错误分类在蛋白质结晶等生物学研究中具有很高的成本。对于药物开发,蛋白质的三维结构是通过首先结晶蛋白质获得的。因此,缺少结晶状态可能会阻碍药物的开发。开发能够避免或最小化错误分类的分类算法是很重要的。传统的决策树分类器基于杂质度量,该度量标识在决策树的早期级别选择的最有信息的属性。类标签是基于叶节点上的大多数类标签来选择的。我们通过分析每个类的属性的纯区域或范围,引入了一种新的决策树分类器else-tree。在确定每个类的最长或最密集的连续范围之后,其余的范围被馈送到决策树的其他分支中。只有冲突或可疑的样本被传递到决策树的较低层次。它不必为难以分类的样本指定一个类。我们已经使用我们的蛋白质结晶试验数据和其他三个公开可用的数据集来评估else-tree。实验表明,当训练集是数据集的良好表示时,else-tree可以通过将困难样本标记为未确定来将误分类率降低到0%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信