{"title":"基于卷积神经网络的PRDM9结合位点预测及基因重组图谱验证","authors":"Takahiro Nakamura, T. Endo, N. Osada","doi":"10.2197/ipsjtbio.15.9","DOIUrl":null,"url":null,"abstract":": PR domain-containing 9 (PRDM9) is a zinc-finger protein that binds to specific DNA motifs and induces the crossing-over between chromosomes, resulting in a high recombination rate around binding sites. Currently, the binding sites of PRDM9 are predicted with methods based on motif matching and Position-specific Weight Matrix (PWM). Meanwhile, the Convolutional Neural Network (CNN) has shown superior performance in recent studies to identify protein-binding regions in general, and it is expected to perform well in PRDM9 binding site prediction. In this study, we compared the performance of PWM and CNN for predicting PRDM9 binding sites with not only test data but also the correlation between the prediction score for a fragment and the local recombination rate to evaluate the performance without overfitting e ff ects. Approximately 170,000 genomic DNA fragments of the human genome containing the Chromatin Immuno-Precipitation with high-throughput sequencing (ChIP-seq) peak of PRDM9 were used for constructing PWM and CNN. We found that CNN outperformed PWM in terms of area under the ROC curve and other metrics. Furthermore, the prediction scores of CNN correlated more strongly with the local recombination rate than PWM. We discuss that the superior performance of CNN would be in part due to the ability of CNN to capture the feature of surrounding sequences of actual PRDM9-binding sites.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting PRDM9 Binding Sites by a Convolutional Neural Network and Verification Using Genetic Recombination Map\",\"authors\":\"Takahiro Nakamura, T. Endo, N. Osada\",\"doi\":\"10.2197/ipsjtbio.15.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": PR domain-containing 9 (PRDM9) is a zinc-finger protein that binds to specific DNA motifs and induces the crossing-over between chromosomes, resulting in a high recombination rate around binding sites. Currently, the binding sites of PRDM9 are predicted with methods based on motif matching and Position-specific Weight Matrix (PWM). Meanwhile, the Convolutional Neural Network (CNN) has shown superior performance in recent studies to identify protein-binding regions in general, and it is expected to perform well in PRDM9 binding site prediction. In this study, we compared the performance of PWM and CNN for predicting PRDM9 binding sites with not only test data but also the correlation between the prediction score for a fragment and the local recombination rate to evaluate the performance without overfitting e ff ects. Approximately 170,000 genomic DNA fragments of the human genome containing the Chromatin Immuno-Precipitation with high-throughput sequencing (ChIP-seq) peak of PRDM9 were used for constructing PWM and CNN. We found that CNN outperformed PWM in terms of area under the ROC curve and other metrics. Furthermore, the prediction scores of CNN correlated more strongly with the local recombination rate than PWM. We discuss that the superior performance of CNN would be in part due to the ability of CNN to capture the feature of surrounding sequences of actual PRDM9-binding sites.\",\"PeriodicalId\":38959,\"journal\":{\"name\":\"IPSJ Transactions on Bioinformatics\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IPSJ Transactions on Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2197/ipsjtbio.15.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Biochemistry, Genetics and Molecular Biology\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IPSJ Transactions on Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/ipsjtbio.15.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
Predicting PRDM9 Binding Sites by a Convolutional Neural Network and Verification Using Genetic Recombination Map
: PR domain-containing 9 (PRDM9) is a zinc-finger protein that binds to specific DNA motifs and induces the crossing-over between chromosomes, resulting in a high recombination rate around binding sites. Currently, the binding sites of PRDM9 are predicted with methods based on motif matching and Position-specific Weight Matrix (PWM). Meanwhile, the Convolutional Neural Network (CNN) has shown superior performance in recent studies to identify protein-binding regions in general, and it is expected to perform well in PRDM9 binding site prediction. In this study, we compared the performance of PWM and CNN for predicting PRDM9 binding sites with not only test data but also the correlation between the prediction score for a fragment and the local recombination rate to evaluate the performance without overfitting e ff ects. Approximately 170,000 genomic DNA fragments of the human genome containing the Chromatin Immuno-Precipitation with high-throughput sequencing (ChIP-seq) peak of PRDM9 were used for constructing PWM and CNN. We found that CNN outperformed PWM in terms of area under the ROC curve and other metrics. Furthermore, the prediction scores of CNN correlated more strongly with the local recombination rate than PWM. We discuss that the superior performance of CNN would be in part due to the ability of CNN to capture the feature of surrounding sequences of actual PRDM9-binding sites.