Wanying Dou, Yihang Liu, Zehai Liu, D. Yerezhepov, U. Kozhamkulov, A. Akilzhanova, Omar Dib, Chee-Kai Chan
{"title":"An AutoML Approach for Predicting Risk of Progression to Active Tuberculosis based on Its Association with Host Genetic Variations","authors":"Wanying Dou, Yihang Liu, Zehai Liu, D. Yerezhepov, U. Kozhamkulov, A. Akilzhanova, Omar Dib, Chee-Kai Chan","doi":"10.1145/3498731.3498743","DOIUrl":null,"url":null,"abstract":"Tuberculosis (TB) is a worldwide health challenge. Mycobacterium tuberculosis(M.tb) is capable of evading the host immune system which can lead to tuberculosis infection. Household contacts (HHCs) of TB cases have a higher risk of infection. Novel predictive techniques to identify high-risk TB susceptible groups are needed. Susceptibility to Tuberculosis is associated with host genetic variations. This research work uses the TPOT autoML tool to map genetic variations and TB infection status mathematically. Machine learning was employed to predict the risk of progression to active tuberculosis based on associated host genetic variation. Among the three adopted configurations, \"TPOT Default\", \"TPOT spars\", \"TPOT N that were used,” “TPOT Default,\" and \"TPOT sparse\" produced the same best performance both reaching 0.816 Training CV score and 0.625 Testing Accuracy. Different genes variants identified using this approach were found to have distinctive contributions for TB infection, which represent the feature importance of the classifier. The feature importance of the random forest classifier pipeline in \"TPOT sparse\" was adopted. The top ten contributing genes were also submitted to Enrichr for gene pathway enrichment analysis. The identified enriched pathways have been shown to be key to TB infection.","PeriodicalId":166893,"journal":{"name":"Proceedings of the 2021 10th International Conference on Bioinformatics and Biomedical Science","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 10th International Conference on Bioinformatics and Biomedical Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3498731.3498743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Tuberculosis (TB) is a worldwide health challenge. Mycobacterium tuberculosis(M.tb) is capable of evading the host immune system which can lead to tuberculosis infection. Household contacts (HHCs) of TB cases have a higher risk of infection. Novel predictive techniques to identify high-risk TB susceptible groups are needed. Susceptibility to Tuberculosis is associated with host genetic variations. This research work uses the TPOT autoML tool to map genetic variations and TB infection status mathematically. Machine learning was employed to predict the risk of progression to active tuberculosis based on associated host genetic variation. Among the three adopted configurations, "TPOT Default", "TPOT spars", "TPOT N that were used,” “TPOT Default," and "TPOT sparse" produced the same best performance both reaching 0.816 Training CV score and 0.625 Testing Accuracy. Different genes variants identified using this approach were found to have distinctive contributions for TB infection, which represent the feature importance of the classifier. The feature importance of the random forest classifier pipeline in "TPOT sparse" was adopted. The top ten contributing genes were also submitted to Enrichr for gene pathway enrichment analysis. The identified enriched pathways have been shown to be key to TB infection.