Effect of Various Data Preprocessing in Sequence Embedding-Based Machine Learning for Human-Virus PPI Classification

2021 4th International Conference of Computer and Informatics Engineering (IC2IE) Pub Date : 2021-09-14 DOI:10.1109/ic2ie53219.2021.9649426

Fatma Indriani, Kunti Rabiatul Mahmudah, K. Satou

引用次数: 0

Abstract

Identifying human-virus protein-protein interactions (PPI) is an important task which is increasingly researched using computational methods. Previous research shows that using doc2vec encoding scheme for features combined with Random Forest classifier gives promising performance. However, human-virus PPI data are usually imbalanced, and additional preprocessing step has not been investigated in this task. In this work, we investigated various preprocessing methods and modifications to improve classification performance. The result shows that a modification in the feature formulation method, combined with random oversampling can improve the classification AUC result from 0.9414 to 0.9448.

查看原文本刊更多论文

基于序列嵌入的机器学习中各种数据预处理对人类病毒PPI分类的影响

识别人-病毒蛋白-蛋白相互作用(PPI)是一项重要的任务，越来越多的研究使用计算方法。已有的研究表明，将doc2vec编码方案与随机森林分类器相结合，对特征进行编码具有良好的性能。然而，人类病毒PPI数据通常是不平衡的，并且在本任务中尚未研究额外的预处理步骤。在这项工作中，我们研究了各种预处理方法和修改，以提高分类性能。结果表明，对特征表述方法进行修改，结合随机过采样，可以将分类AUC结果从0.9414提高到0.9448。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 4th International Conference of Computer and Informatics Engineering (IC2IE)

自引率

0.00%

发文量