适用于新型和多样化测序技术的通用蛋白质识别方法。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2024-09-18 eCollection Date: 2024-09-01 DOI:10.1093/nargab/lqae126

Bikash Kumar Bhandari, Nick Goldman

{"title":"适用于新型和多样化测序技术的通用蛋白质识别方法。","authors":"Bikash Kumar Bhandari, Nick Goldman","doi":"10.1093/nargab/lqae126","DOIUrl":null,"url":null,"abstract":"Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving de novo sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database (N = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae126"},"PeriodicalIF":4.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409062/pdf/","citationCount":"0","resultStr":"{\"title\":\"A generalized protein identification method for novel and diverse sequencing technologies.\",\"authors\":\"Bikash Kumar Bhandari, Nick Goldman\",\"doi\":\"10.1093/nargab/lqae126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving de novo sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database (N = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"6 3\",\"pages\":\"lqae126\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409062/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqae126\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质测序是一个快速发展的领域，在实现新一代蛋白质测序仪方面取得了很大进展。然而，早期的设备可能无法可靠地辨别全部 20 个氨基酸，从而产生部分的、嘈杂的、可能容易出错的蛋白质特征。这些设备的目标可能不是实现从头测序，而是通过将这些特征与已知蛋白质数据库进行比较来识别目标蛋白质。然而，目前还没有广泛适用于这一识别问题的方法。在这里，我们设计了一种隐马尔可夫模型方法来研究从嘈杂的特征数据中识别蛋白质的一般问题。基于一个可以模拟多种新技术的假定测序设备，我们证明了在人类蛋白质数据库（N = 20 181）中，我们的方法在多种不同的操作条件下具有良好的性能，如不同水平的信号解析度、不同数量的被鉴别氨基酸、序列片段以及插入和删除错误率。我们的研究结果表明，在许多早期实验设备上都可以高精度地识别蛋白质。我们预计，我们的方法未来将适用于各种蛋白质测序设备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A generalized protein identification method for novel and diverse sequencing technologies.

Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving de novo sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database (N = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NAR Genomics and Bioinformatics Multiple-

CiteScore

8.00

自引率

2.20%

发文量

审稿时长

15 weeks