A Machine Learning Approach to Identify C Type Lectin Domain (CTLD) Containing Proteins

IF 1.4 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

The Protein Journal Pub Date : 2024-07-28 DOI:10.1007/s10930-024-10224-x

Lovepreet Singh, Sukhwinder Singh, Desh Deepak Singh

{"title":"A Machine Learning Approach to Identify C Type Lectin Domain (CTLD) Containing Proteins","authors":"Lovepreet Singh, Sukhwinder Singh, Desh Deepak Singh","doi":"10.1007/s10930-024-10224-x","DOIUrl":null,"url":null,"abstract":"<div><p>Lectins are sugar interacting proteins which bind specific glycans reversibly and have ubiquitous presence in all forms of life. They have diverse biological functions such as cell signaling, molecular recognition, etc. C-type lectins (CTL) are a group of proteins from the lectin family which have been studied extensively in animals and are reported to be involved in immune functions, carcinogenesis, cell signaling, etc. The carbohydrate recognition domain (CRD) in CTL has a highly variable protein sequence and proteins carrying this domain are also referred to as C-type lectin domain containing proteins (CTLD). Because of this low sequence homology, identification of CTLD from hypothetical proteins in the sequenced genomes using homology based programs has limitations. Machine learning (ML) tools use characteristic features to identify homologous sequences and it has been used to develop a tool for identification of CTLD. Initially 500 sequences of well annotated CTLD and 500 sequences of non CTLD were used in developing the machine learning model. The classifier program Linear SVC from sci kit library of python was used and characteristic features in CTLD sequences like dipeptide and tripeptide composition were used as training attributes in various classifiers. A precision, recall and multiple correlation coefficient (MCC) value of 0.92, 0.91 and 0.82 respectively were obtained when tested on external test set. On fine tuning of the parameters like kernel, C value, gamma, degree and increasing number of non CTLD sequences there was improvement in precision, recall and MCC and the corresponding values were 0.99, 0.99 and 0.96. New CTLD have also been identified in the hypothetical segment of human genome using the trained model. The tool is available on our local server for interested users.</p></div>","PeriodicalId":793,"journal":{"name":"The Protein Journal","volume":"43 4","pages":"718 - 725"},"PeriodicalIF":1.4000,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Protein Journal","FirstCategoryId":"2","ListUrlMain":"https://link.springer.com/article/10.1007/s10930-024-10224-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Lectins are sugar interacting proteins which bind specific glycans reversibly and have ubiquitous presence in all forms of life. They have diverse biological functions such as cell signaling, molecular recognition, etc. C-type lectins (CTL) are a group of proteins from the lectin family which have been studied extensively in animals and are reported to be involved in immune functions, carcinogenesis, cell signaling, etc. The carbohydrate recognition domain (CRD) in CTL has a highly variable protein sequence and proteins carrying this domain are also referred to as C-type lectin domain containing proteins (CTLD). Because of this low sequence homology, identification of CTLD from hypothetical proteins in the sequenced genomes using homology based programs has limitations. Machine learning (ML) tools use characteristic features to identify homologous sequences and it has been used to develop a tool for identification of CTLD. Initially 500 sequences of well annotated CTLD and 500 sequences of non CTLD were used in developing the machine learning model. The classifier program Linear SVC from sci kit library of python was used and characteristic features in CTLD sequences like dipeptide and tripeptide composition were used as training attributes in various classifiers. A precision, recall and multiple correlation coefficient (MCC) value of 0.92, 0.91 and 0.82 respectively were obtained when tested on external test set. On fine tuning of the parameters like kernel, C value, gamma, degree and increasing number of non CTLD sequences there was improvement in precision, recall and MCC and the corresponding values were 0.99, 0.99 and 0.96. New CTLD have also been identified in the hypothetical segment of human genome using the trained model. The tool is available on our local server for interested users.

Abstract Image

查看原文本刊更多论文

识别含 C 型连接蛋白域 (CTLD) 蛋白质的机器学习方法

凝集素是一种与糖相互作用的蛋白质，可逆地与特定的糖结合，在所有生命形式中无处不在。它们具有多种生物功能，如细胞信号传导、分子识别等。C 型凝集素（CTL）是凝集素家族中的一类蛋白质，已在动物体内进行了广泛的研究，据报道它参与免疫功能、致癌、细胞信号传导等。CTL 中的碳水化合物识别结构域（CRD）具有高度可变的蛋白质序列，携带该结构域的蛋白质也被称为含 C 型凝集素结构域的蛋白质（CTLD）。由于序列同源性较低，因此使用基于同源性的程序从已测序基因组中的假定蛋白质中识别 CTLD 有其局限性。机器学习（ML）工具利用特征来识别同源序列，它已被用于开发一种识别 CTLD 的工具。在开发机器学习模型时，最初使用了 500 个注释良好的 CTLD 序列和 500 个非 CTLD 序列。使用了 python sci kit 库中的分类器程序 Linear SVC，并将 CTLD 序列中的特征（如二肽和三肽组成）作为各种分类器的训练属性。在外部测试集上进行测试时，精确度、召回率和多重相关系数（MCC）值分别为 0.92、0.91 和 0.82。在微调内核、C 值、伽马值、度数等参数以及增加非 CTLD 序列的数量后，精确度、召回率和多重相关系数均有所提高，相应的值分别为 0.99、0.99 和 0.96。利用训练有素的模型还在人类基因组的假设片段中鉴定出了新的 CTLD。该工具可在我们的本地服务器上供感兴趣的用户使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Protein Journal 生物-生化与分子生物学

CiteScore

5.20

自引率

0.00%

发文量

审稿时长

12 months

期刊介绍： The Protein Journal (formerly the Journal of Protein Chemistry) publishes original research work on all aspects of proteins and peptides. These include studies concerned with covalent or three-dimensional structure determination (X-ray, NMR, cryoEM, EPR/ESR, optical methods, etc.), computational aspects of protein structure and function, protein folding and misfolding, assembly, genetics, evolution, proteomics, molecular biology, protein engineering, protein nanotechnology, protein purification and analysis and peptide synthesis, as well as the elucidation and interpretation of the molecular bases of biological activities of proteins and peptides. We accept original research papers, reviews, mini-reviews, hypotheses, opinion papers, and letters to the editor.