SSR_VibraProfiler: a Python package for accurate classification of varieties using SSRs with intra-variety specificity and inter-variety polymorphism.

IF 4.4 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Chenhao Jiang, Chuan Dong, Zhenzhen Wu, Chenyi Shi, Qiannan Ye, Xiaopei Wu, Siyi Ma, Yuming Wen, Guoping Yu, Jiasheng Wu, Chengjun Zhang
{"title":"SSR_VibraProfiler: a Python package for accurate classification of varieties using SSRs with intra-variety specificity and inter-variety polymorphism.","authors":"Chenhao Jiang, Chuan Dong, Zhenzhen Wu, Chenyi Shi, Qiannan Ye, Xiaopei Wu, Siyi Ma, Yuming Wen, Guoping Yu, Jiasheng Wu, Chengjun Zhang","doi":"10.1186/s13007-025-01380-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Simple sequence repeats (SSRs) are widely used as molecular markers; however, traditional development of SSR molecular markers heavily relies on experimental methods. The advancement of modern sequencing technology has provided the possibility of directly extracting SSR characteristics from sequencing data and using them for variety identification.</p><p><strong>Results: </strong>We have developed a computational framework for variety identification, treating the presence or absence of each SSR in sequencing data as a numerical characteristic while ignoring specific loci, flanking sequences, and occurrence counts. Therefore, subsequent variety identification does not rely on experimental validation but is directly performed based on the numerical characteristic matrix. Using a formula, we measure the variance of these numerical characteristics both within and among varieties, and select SSRs that exhibit intra-variety specificity and inter-variety polymorphism, forming a 0,1 matrix. We use t-SNE (t-distributed Stochastic Neighbor Embedding) to project the matrix onto a two-dimensional plane, followed by K-means clustering of the individuals. The classification performance of the matrix is preliminarily assessed by comparing the cluster labels with the true labels, providing an initial evaluation of its effectiveness in variety detection. Ultimately, we construct a recognition model based on the SSRs matrix and apply it for variety identification. The process has been encapsulated into the package SSR_VibraProfiler, which can serve as a tool for constructing an SSR variety DNA fingerprint database. We tested this package on a Rhododendron dataset that included 40 individuals from 8 varieties. The accuracy achieved through t-SNE dimensionality reduction and K-means clustering was 100%. Furthermore, we used the leave-one-out method to validate the accuracy of our method in predicting variety, and confirmed the reliability of our method in detecting varieties. The package is freely available at https://github.com/Olcat35412/SSR_VibraProfiler .</p><p><strong>Conclusion: </strong>We introduced SSR_VibraProfiler, a Python package for distinguishing and predicting individual varieties without a reference genome by extracting SSR numerical characteristics from next-generation sequencing data. This tool will contribute to the development, identification, and protection of new varieties.</p>","PeriodicalId":20100,"journal":{"name":"Plant Methods","volume":"21 1","pages":"61"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082954/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Methods","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13007-025-01380-x","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Simple sequence repeats (SSRs) are widely used as molecular markers; however, traditional development of SSR molecular markers heavily relies on experimental methods. The advancement of modern sequencing technology has provided the possibility of directly extracting SSR characteristics from sequencing data and using them for variety identification.

Results: We have developed a computational framework for variety identification, treating the presence or absence of each SSR in sequencing data as a numerical characteristic while ignoring specific loci, flanking sequences, and occurrence counts. Therefore, subsequent variety identification does not rely on experimental validation but is directly performed based on the numerical characteristic matrix. Using a formula, we measure the variance of these numerical characteristics both within and among varieties, and select SSRs that exhibit intra-variety specificity and inter-variety polymorphism, forming a 0,1 matrix. We use t-SNE (t-distributed Stochastic Neighbor Embedding) to project the matrix onto a two-dimensional plane, followed by K-means clustering of the individuals. The classification performance of the matrix is preliminarily assessed by comparing the cluster labels with the true labels, providing an initial evaluation of its effectiveness in variety detection. Ultimately, we construct a recognition model based on the SSRs matrix and apply it for variety identification. The process has been encapsulated into the package SSR_VibraProfiler, which can serve as a tool for constructing an SSR variety DNA fingerprint database. We tested this package on a Rhododendron dataset that included 40 individuals from 8 varieties. The accuracy achieved through t-SNE dimensionality reduction and K-means clustering was 100%. Furthermore, we used the leave-one-out method to validate the accuracy of our method in predicting variety, and confirmed the reliability of our method in detecting varieties. The package is freely available at https://github.com/Olcat35412/SSR_VibraProfiler .

Conclusion: We introduced SSR_VibraProfiler, a Python package for distinguishing and predicting individual varieties without a reference genome by extracting SSR numerical characteristics from next-generation sequencing data. This tool will contribute to the development, identification, and protection of new varieties.

SSR_VibraProfiler:一个Python包,用于使用具有品种内特异性和品种间多态性的ssr对品种进行准确分类。
背景:简单重复序列(Simple sequence repeats, SSRs)被广泛用作分子标记;然而,传统的SSR分子标记开发很大程度上依赖于实验方法。现代测序技术的进步,为直接从测序数据中提取SSR特征并用于品种鉴定提供了可能。结果:我们开发了一个用于品种鉴定的计算框架,将测序数据中每个SSR的存在或不存在作为一个数字特征,而忽略特定的位点、侧翼序列和发生计数。因此,后续的品种鉴定不依赖于实验验证,而是直接基于数值特征矩阵进行。利用一个公式,我们测量了这些数值特征在品种内和品种间的方差,并选择了表现出品种内特异性和品种间多态性的ssr,形成一个0,1矩阵。我们使用t-SNE (t-分布随机邻居嵌入)将矩阵投影到二维平面上,然后对个体进行k均值聚类。通过将聚类标签与真实标签进行比较,初步评价该矩阵的分类性能,初步评价其在品种检测中的有效性。最后,我们构建了一个基于SSRs矩阵的识别模型,并将其应用于品种识别。该过程已被封装到SSR_VibraProfiler程序包中,可作为构建SSR品种DNA指纹库的工具。我们在一个杜鹃花(Rhododendron)数据集上进行了测试,该数据集包括来自8个品种的40个个体。通过t-SNE降维和K-means聚类,准确率达到100%。利用留一法验证了该方法预测品种的准确性,验证了该方法检测品种的可靠性。结论:我们引入了一个Python包SSR_VibraProfiler,它可以通过从下一代测序数据中提取SSR数字特征来区分和预测没有参考基因组的单个品种。该工具将有助于新品种的开发、鉴定和保护。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Plant Methods
Plant Methods 生物-植物科学
CiteScore
9.20
自引率
3.90%
发文量
121
审稿时长
2 months
期刊介绍: Plant Methods is an open access, peer-reviewed, online journal for the plant research community that encompasses all aspects of technological innovation in the plant sciences. There is no doubt that we have entered an exciting new era in plant biology. The completion of the Arabidopsis genome sequence, and the rapid progress being made in other plant genomics projects are providing unparalleled opportunities for progress in all areas of plant science. Nevertheless, enormous challenges lie ahead if we are to understand the function of every gene in the genome, and how the individual parts work together to make the whole organism. Achieving these goals will require an unprecedented collaborative effort, combining high-throughput, system-wide technologies with more focused approaches that integrate traditional disciplines such as cell biology, biochemistry and molecular genetics. Technological innovation is probably the most important catalyst for progress in any scientific discipline. Plant Methods’ goal is to stimulate the development and adoption of new and improved techniques and research tools and, where appropriate, to promote consistency of methodologies for better integration of data from different laboratories.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信