Else-Tree Classifier for Minimizing Misclassification of Biological Data

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2018-12-01 DOI:10.1109/BIBM.2018.8621322

Truong X. Tran, M. Pusey, R. S. Aygün

{"title":"Else-Tree Classifier for Minimizing Misclassification of Biological Data","authors":"Truong X. Tran, M. Pusey, R. S. Aygün","doi":"10.1109/BIBM.2018.8621322","DOIUrl":null,"url":null,"abstract":"Misclassification has a high cost in biological research studies such as protein crystallization. For drug development, the 3D structure of a protein is obtained by first crystallizing the protein. Hence, missing a crystalline condition may hinder the development of a drug. It is important to develop classification algorithms that would avoid or minimize misclassifications. Traditional decision tree classifiers are based on an impurity measure that identifies the most informative attribute to be selected at the early levels of a decision tree. The class labels are chosen based on majority of class labels at a leaf node. We introduce a novel decision tree classifier, else-tree, by analyzing pure regions or ranges of an attribute per class. After identifying the longest or most populated contiguous range per class, the rest of the ranges are fed into else branch of the decision tree. Only conflicting or doubtful samples are passed to the lower levels of the decision tree. It does not necessarily assign a class for difficult samples to classify. We have used our protein crystallization trials data and three other publicly available datasets to evaluate else-tree. The experiments show that the else-tree may reduce the misclassification to 0% by labeling difficult samples as undecided when the training set is a good representation of the dataset.","PeriodicalId":108667,"journal":{"name":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2018.8621322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Misclassification has a high cost in biological research studies such as protein crystallization. For drug development, the 3D structure of a protein is obtained by first crystallizing the protein. Hence, missing a crystalline condition may hinder the development of a drug. It is important to develop classification algorithms that would avoid or minimize misclassifications. Traditional decision tree classifiers are based on an impurity measure that identifies the most informative attribute to be selected at the early levels of a decision tree. The class labels are chosen based on majority of class labels at a leaf node. We introduce a novel decision tree classifier, else-tree, by analyzing pure regions or ranges of an attribute per class. After identifying the longest or most populated contiguous range per class, the rest of the ranges are fed into else branch of the decision tree. Only conflicting or doubtful samples are passed to the lower levels of the decision tree. It does not necessarily assign a class for difficult samples to classify. We have used our protein crystallization trials data and three other publicly available datasets to evaluate else-tree. The experiments show that the else-tree may reduce the misclassification to 0% by labeling difficult samples as undecided when the training set is a good representation of the dataset.

查看原文本刊更多论文

最小化生物数据误分类的Else-Tree分类器

错误分类在蛋白质结晶等生物学研究中具有很高的成本。对于药物开发，蛋白质的三维结构是通过首先结晶蛋白质获得的。因此，缺少结晶状态可能会阻碍药物的开发。开发能够避免或最小化错误分类的分类算法是很重要的。传统的决策树分类器基于杂质度量，该度量标识在决策树的早期级别选择的最有信息的属性。类标签是基于叶节点上的大多数类标签来选择的。我们通过分析每个类的属性的纯区域或范围，引入了一种新的决策树分类器else-tree。在确定每个类的最长或最密集的连续范围之后，其余的范围被馈送到决策树的其他分支中。只有冲突或可疑的样本被传递到决策树的较低层次。它不必为难以分类的样本指定一个类。我们已经使用我们的蛋白质结晶试验数据和其他三个公开可用的数据集来评估else-tree。实验表明，当训练集是数据集的良好表示时，else-tree可以通过将困难样本标记为未确定来将误分类率降低到0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量