{"title":"STRAS:a snakemake pipeline for genome-wide short tandem repeats annotation and score.","authors":"Mengna Zhang","doi":"10.1007/s00439-024-02662-5","DOIUrl":null,"url":null,"abstract":"<p><p>High-throughput whole genome sequencing (WGS) is clinically used in finding single nucleotide variants and small indels. Several bioinformatics tools are developed to call short tandem repeats (STRs) copy numbers from WGS data, such as ExpansionHunter denovo, GangSTR and HipSTR. However, expansion disorders are rare and it is hard to find candidate expansions in single patient sequencing data with ~ 800,000 STRs calls. In this paper I describe a snakemake pipeline for genome-wide STRs Annotation and Score (STRAS) using a Random Forest (RF) model to predict pathogenicity. The predictor was validated by benchmark data from Clinvar and PUBMED. True positive rate was 93.8%. True negative rate was 98.0%.Precision was 98.6% and recall rate was 93.8%. F1-score was 0.961. Sensitivity was 93.8% and specificity was 99.6%. These results showed STRAS could be a useful tool for clinical researchers to find STR loci of interest and filter out neutral STRs. STRAS is freely available at https://github.com/fancheyu5/STRAS .</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-024-02662-5","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/20 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
High-throughput whole genome sequencing (WGS) is clinically used in finding single nucleotide variants and small indels. Several bioinformatics tools are developed to call short tandem repeats (STRs) copy numbers from WGS data, such as ExpansionHunter denovo, GangSTR and HipSTR. However, expansion disorders are rare and it is hard to find candidate expansions in single patient sequencing data with ~ 800,000 STRs calls. In this paper I describe a snakemake pipeline for genome-wide STRs Annotation and Score (STRAS) using a Random Forest (RF) model to predict pathogenicity. The predictor was validated by benchmark data from Clinvar and PUBMED. True positive rate was 93.8%. True negative rate was 98.0%.Precision was 98.6% and recall rate was 93.8%. F1-score was 0.961. Sensitivity was 93.8% and specificity was 99.6%. These results showed STRAS could be a useful tool for clinical researchers to find STR loci of interest and filter out neutral STRs. STRAS is freely available at https://github.com/fancheyu5/STRAS .
期刊介绍:
Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology.
Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted.
The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.