Guilherme Miura Lavezzo, Marcelo de Souza Lauretto, Luiz Paulo Moura Andrioli, Ariane Machado-Lima
{"title":"位置权重矩阵或循环概率有限自动机:使用哪种模型?用于预测转录因子结合位点的决策规则推论。","authors":"Guilherme Miura Lavezzo, Marcelo de Souza Lauretto, Luiz Paulo Moura Andrioli, Ariane Machado-Lima","doi":"10.1590/1678-4685-GMB-2023-0048","DOIUrl":null,"url":null,"abstract":"<p><p>Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.</p>","PeriodicalId":12557,"journal":{"name":"Genetics and Molecular Biology","volume":"46 4","pages":"e20230048"},"PeriodicalIF":1.7000,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10945726/pdf/","citationCount":"0","resultStr":"{\"title\":\"Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites.\",\"authors\":\"Guilherme Miura Lavezzo, Marcelo de Souza Lauretto, Luiz Paulo Moura Andrioli, Ariane Machado-Lima\",\"doi\":\"10.1590/1678-4685-GMB-2023-0048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.</p>\",\"PeriodicalId\":12557,\"journal\":{\"name\":\"Genetics and Molecular Biology\",\"volume\":\"46 4\",\"pages\":\"e20230048\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-01-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10945726/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genetics and Molecular Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1590/1678-4685-GMB-2023-0048\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetics and Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1590/1678-4685-GMB-2023-0048","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites.
Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.
期刊介绍:
Genetics and Molecular Biology (formerly named Revista Brasileira de Genética/Brazilian Journal of Genetics - ISSN 0100-8455) is published by the Sociedade Brasileira de Genética (Brazilian Society of Genetics).
The Journal considers contributions that present the results of original research in genetics, evolution and related scientific disciplines. Manuscripts presenting methods and applications only, without an analysis of genetic data, will not be considered.