{"title":"WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences.","authors":"George Glidden-Handgis, Travis J Wheeler","doi":"10.1093/bioadv/vbae052","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.</p><p><strong>Results: </strong>We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.</p><p><strong>Impact: </strong>Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11099658/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.
Results: We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.
Impact: Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
背景:用于标记生物序列的软件通常会为每个匹配序列生成一个基于理论的统计量(E 值),该统计量表示偶然看到该匹配序列得分的可能性。E 值可以准确预测随机(洗牌)序列比较的错误匹配率,从而为设置得分阈值提供了合理的机制,使其能够以较低的预期错误匹配率获得较高的灵敏度。这种阈值设置策略受到了真实生物序列的挑战,因为真实生物序列包含局部重复和低序列复杂性区域,这些区域会导致非同源序列之间的过度匹配。了解到这一点后,工具开发人员通常会开发一些基准,使用看似真实的诱饵序列来探索灵敏度和错误匹配率之间的经验权衡。最近的一个趋势是使用反向生物序列作为现实诱饵,因为这些序列保留了字母的分布和局部重复的存在,同时破坏了原始序列的功能特性。然而,我们和其他人观察到,序列似乎以惊人的频率与其反向序列产生高分比对,导致虚假匹配风险被夸大,可能对下游分析产生负面影响:我们证明,序列 S 与其(可能变异的)反向序列之间的比对往往比真正不相关的序列之间的比对产生更高的得分,即使 S 是一个没有明显重复或低复杂性区域的洗牌字符串。这种现象是由于一个不直观的事实,即(即使是随机洗牌的)序列包含的回文平均长度比同一序列的排列变体之间共享的最长公共子串(LCS)要长。虽然预期的回文长度只比预期的最长公共子串稍大,但涉及反转序列的配准得分分布却强烈右移,导致反转序列的高分配准频率大大增加:高估错误匹配风险会导致不必要的高分阈值,从而可能降低真正的匹配灵敏度。此外,当工具灵敏度只报告到第一个匹配诱饵序列的得分时,由反向序列组成的大型诱饵集可能会掩盖工具之间的灵敏度差异。根据上述观察结果,我们建议只有在注意去除原始(未反转)序列中的阳性匹配,或不担心虚假标记的夸大时,才使用反转生物序列作为诱饵。虽然分析的主要重点是序列注释,但我们也证明了内部回文的普遍存在可能会导致质谱法蛋白质鉴定中错误标记率的夸大。