Genome researchPub Date : 2024-10-15DOI: 10.1101/gr.279351.124
Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege
{"title":"Characterising tandem repeat complexities across long-read sequencing platforms with TREAT and otter","authors":"Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege","doi":"10.1101/gr.279351.124","DOIUrl":"https://doi.org/10.1101/gr.279351.124","url":null,"abstract":"Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the accurate characterization of TRs, however, the underlying bioinformatics perspectives remain challenging. We present otter and TREAT: otter is a fast targeted local assembler, cross-compatible across different sequencing platforms. It is integrated in TREAT, an end-to-end workflow for TR characterization, visualization and analysis across multiple genomes. In a comparison with existing tools based on long-read sequencing data from both Oxford Nanopore Technology (ONT, Simplex and Duplex) and PacBio (Sequel 2 and Revio), otter and TREAT achieved state-of-the-art genotyping and motif characterisation accuracy. Applied to clinically relevant TRs, TREAT/otter significantly identified individuals with pathogenic TR expansions. When applied to a case-control setting, we significantly replicated previously reported associations of TRs with Alzheimer's Disease, including those near or within <em>APOC1</em> (p=2.63x10-9), <em>SPI1</em> (p=6.5x10-3) and <em>ABCA7</em> (p=0.04) genes. We used TREAT/otter to systematically evaluate potential biases when genotyping TRs using diverse ONT and PacBio long-read sequencing datasets. We showed that, in rare cases (0.06%), long-read sequencing suffers from coverage drops in TRs, including the disease-associated TRs in <em>ABCA7</em> and <em>RFC1</em> genes. Such coverage drops can lead to TR misgenotyping, hampering the accurate characterization of TR alleles. Taken together, our tools can accurately genotype TR across different sequencing technologies and with minimal requirements, allowing end-to-end analysis and comparisons of TR in human genomes, with broad applications in research and clinical fields.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"59 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-10-15DOI: 10.1101/gr.278960.124
Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford
{"title":"Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework","authors":"Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford","doi":"10.1101/gr.278960.124","DOIUrl":"https://doi.org/10.1101/gr.278960.124","url":null,"abstract":"Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"1 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-10-11DOI: 10.1101/gr.279131.124
Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna
{"title":"CoRAL accurately resolves extrachromosomal DNA genome structures with long-read sequencing.","authors":"Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna","doi":"10.1101/gr.279131.124","DOIUrl":"10.1101/gr.279131.124","url":null,"abstract":"<p><p>Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in ∼15% of early-stage cancers and ∼30% of late-stage cancers. ecDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, in which the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graphs have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose Complete Reconstruction of Amplifications with Long reads (CoRAL) for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 10 data sets from previously characterized cell lines compared with previous short- and long-read-based tools. As long-read usage becomes widespread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1344-1354"},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529860/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141563231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-10-11DOI: 10.1101/gr.279117.124
Enakshi Saha, Viola Fanfani, Panagiotis Mandros, Marouen Ben Guebila, Jonas Fischer, Katherine H Shutta, Dawn L DeMeo, Camila M Lopes-Ramos, John Quackenbush
{"title":"Bayesian inference of sample-specific coexpression networks.","authors":"Enakshi Saha, Viola Fanfani, Panagiotis Mandros, Marouen Ben Guebila, Jonas Fischer, Katherine H Shutta, Dawn L DeMeo, Camila M Lopes-Ramos, John Quackenbush","doi":"10.1101/gr.279117.124","DOIUrl":"10.1101/gr.279117.124","url":null,"abstract":"<p><p>Gene regulatory networks (GRNs) are effective tools for inferring complex interactions between molecules that regulate biological processes and hence can provide insights into drivers of biological systems. Inferring coexpression networks is a critical element of GRN inference, as the correlation between expression patterns may indicate that genes are coregulated by common factors. However, methods that estimate coexpression networks generally derive an aggregate network representing the mean regulatory properties of the population and so fail to fully capture population heterogeneity. Bayesian optimized networks obtained by assimilating omic data (BONOBO) is a scalable Bayesian model for deriving individual sample-specific coexpression matrices that recognizes variations in molecular interactions across individuals. For each sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific coexpression matrix constructed from all other samples in the data. Combining the sample-specific gene coexpression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific coexpression matrices, thus allowing the analysis of large data sets. We demonstrate BONOBO's utility in several contexts, including analyzing gene regulation in yeast transcription factor knockout studies, the prognostic significance of miRNA-mRNA interaction in human breast cancer subtypes, and sex differences in gene regulation within human thyroid tissue. We find that BONOBO outperforms other methods that have been used for sample-specific coexpression network inference and provides insight into individual differences in the drivers of biological processes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1397-1410"},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141970984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-10-11DOI: 10.1101/gr.279143.124
Ghanshyam Chandra, Daniel Gibney, Chirag Jain
{"title":"Haplotype-aware sequence alignment to pangenome graphs.","authors":"Ghanshyam Chandra, Daniel Gibney, Chirag Jain","doi":"10.1101/gr.279143.124","DOIUrl":"10.1101/gr.279143.124","url":null,"abstract":"<p><p>Modern pangenome graphs are built using haplotype-resolved genome assemblies. When mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes improves genotyping accuracy. However, the existing rigorous formulations for colinear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for sequence-to-graph alignment and chaining problems. Inspired by the genotype imputation models, we assume that a query sequence is an imperfect mosaic of reference haplotypes. Accordingly, we introduce a recombination penalty in the scoring functions for each haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in [Formula: see text] time, where <i>Q</i> is the query sequence, <i>E</i> is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than [Formula: see text] is impossible under the strong exponential time hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in [Formula: see text] time after graph preprocessing, where <i>N</i> is the count of input anchors. We then establish that a chaining algorithm significantly faster than [Formula: see text] is impossible under SETH. As a proof-of-concept, we implemented our chaining algorithm in the Minichain aligner. By aligning sequences sampled from the human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes, we demonstrate that our algorithm achieves better consistency with ground-truth recombinations compared with a haplotype-agnostic algorithm.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"1265-1275"},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141626498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-09-27DOI: 10.1101/gr.279252.124
Daniel Sens, Liubov Shilova, Ludwig Gräf, Maria Grebenshchikova, Bjoern M. Eskofier, Francesco Paolo Casale
{"title":"Genetics-driven risk predictions leveraging the Mendelian randomization framework","authors":"Daniel Sens, Liubov Shilova, Ludwig Gräf, Maria Grebenshchikova, Bjoern M. Eskofier, Francesco Paolo Casale","doi":"10.1101/gr.279252.124","DOIUrl":"https://doi.org/10.1101/gr.279252.124","url":null,"abstract":"Accurate predictive models of future disease onset are crucial for effective preventive healthcare, yet longitudinal data sets linking early risk factors to subsequent health outcomes are limited. To overcome this challenge, we introduce a novel framework, <span>P</span>redictive <span>Ri</span>sk modeling using <span>Me</span>ndelian <span>R</span>andomization (PRiMeR), which utilizes genetic effects as supervisory signals to learn disease risk predictors without relying on longitudinal data. To do so, PRiMeR leverages risk factors and genetic data from a healthy cohort, along with results from genome-wide association studies of diseases of interest. After training, the learned predictor can be used to assess risk for new patients solely based on risk factors. We validate PRiMeR through comprehensive simulations and in future type 2 diabetes predictions in UK Biobank participants without diabetes, using follow-up onset labels for validation. Moreover, we apply PRiMeR to predict future Alzheimer's disease onset from brain imaging biomarkers and future Parkinson's disease onset from accelerometer-derived traits. Overall, with PRiMeR we offer a new perspective in predictive modeling, showing it is possible to learn risk predictors leveraging genetics rather than longitudinal data.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"120 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142329221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-09-26DOI: 10.1101/gr.279526.124
Marcin P Sajek, Danielle Y Bilodeau, Michael A Beer, Emma Horton, Yukiko Miyamoto, Katrina B Velle, Lars Eckmann, Lillian Fritz-Laylin, Olivia S Rissland, Neelanjan Mukherjee
{"title":"Evolutionary dynamics of polyadenylation signals and their recognition strategies in protists","authors":"Marcin P Sajek, Danielle Y Bilodeau, Michael A Beer, Emma Horton, Yukiko Miyamoto, Katrina B Velle, Lars Eckmann, Lillian Fritz-Laylin, Olivia S Rissland, Neelanjan Mukherjee","doi":"10.1101/gr.279526.124","DOIUrl":"https://doi.org/10.1101/gr.279526.124","url":null,"abstract":"The poly(A) signal, together with auxiliary elements, directs cleavage of a pre-mRNA and thus determines the 3' end of the mature transcript. In many species, including humans, the poly(A) signal is an AAUAAA hexamer, but we recently found that the deeply branching eukaryote <em>Giardia lamblia</em> uses a distinct hexamer (AGURAA) and lacks any known auxiliary elements. Our discovery prompted us to explore the evolutionary dynamics of poly(A) signals and auxiliary elements in the eukaryotic kingdom. We used direct RNA sequencing to determine poly(A) signals for four protists within the Metamonada clade (which also contains <em>Giardia lamblia</em>) and two outgroup protists. These experiments revealed that the AAUAAA hexamer serves as the poly(A) signal in at least four different eukaryotic clades, indicating that it is likely the ancestral signal, whereas the unusual <em>Giardia</em> version is derived. We found that the use and relative strengths of auxiliary elements are also surprisingly plastic; in fact, within Metamonada, species like <em>Giardia lamblia</em> make use of a previously unrecognized auxiliary element where nucleotides flanking the poly(A) signal itself specify genuine cleavage sites. Thus, despite the fundamental nature of pre-mRNA cleavage for the expression of all protein-coding genes, the motifs controlling this process are dynamic on evolutionary timescales, providing motivation for future biochemical and structural studies as well as new therapeutic angles to target eukaryotic pathogens.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"31 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-09-26DOI: 10.1101/gr.278588.123
Taishan Hu, Timothy L. Mosbruger, Nikolaos G. Tairis, Amalia Dinou, Pushkala Jayaraman, Mahdi Sarmady, Kingham Brewster, Yang Li, Tristan J. Hayeck, Jamie L. Duke, Dimitri S. Monos
{"title":"Targeted and complete genomic sequencing of the Major Histocompatibility Complex in haplotypic form of individual heterozygous samples","authors":"Taishan Hu, Timothy L. Mosbruger, Nikolaos G. Tairis, Amalia Dinou, Pushkala Jayaraman, Mahdi Sarmady, Kingham Brewster, Yang Li, Tristan J. Hayeck, Jamie L. Duke, Dimitri S. Monos","doi":"10.1101/gr.278588.123","DOIUrl":"https://doi.org/10.1101/gr.278588.123","url":null,"abstract":"The human Major Histocompatibility Complex (MHC) is an approximately 4 Mb genomic segment on Chromosome 6 that plays a pivotal role in the immune response. Despite its importance in various traits and diseases, its complex nature makes it challenging to accurately characterize on a routine basis. We present a novel approach allowing targeted sequencing and de novo haplotypic assembly of the MHC region in heterozygous samples, using long-read sequencing technologies. Our approach is validated using two reference samples, two family trios, and an African-American sample. We achieved excellent coverage (96.6-99.9% with at least 30× depth) and high accuracy (99.89-99.99%) for the different haplotypes. This methodology offers a reliable and cost-effective method for sequencing and fully characterizing the MHC without the need for whole-genome sequencing, facilitating broader studies on this important genomic segment and having significant implications in immunology, genetics and medicine.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"217 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-09-25DOI: 10.1101/gr.279142.124
Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan
{"title":"Designing realistic regulatory DNA with autoregressive language models","authors":"Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan","doi":"10.1101/gr.279142.124","DOIUrl":"https://doi.org/10.1101/gr.279142.124","url":null,"abstract":"<em>Cis</em>-regulatory elements (CREs), such as promoters and enhancers, are DNA sequences that regulate the expression of genes. The activity of a CRE is influenced by the order, composition, and spacing of sequence motifs that are bound by proteins called transcription factors (TFs). Synthetic CREs with specific properties are needed for biomanufacturing as well as for many therapeutic applications including cell and gene therapy. Here, we present regLM, a framework to design synthetic CREs with desired properties, such as high, low, or cell type–specific activity, using autoregressive language models in conjunction with supervised sequence-to-function models. We used our framework to design synthetic yeast promoters and cell type–specific human enhancers. We demonstrate that the synthetic CREs generated by our approach are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. regLM thus facilitates the design of realistic regulatory DNA elements while providing insights into the <em>cis</em>-regulatory code.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"65 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome researchPub Date : 2024-09-25DOI: 10.1101/gr.279415.124
James L Shepherdson, David M Granas, Jie Li, Zara Shariff, Stephen P Plassmeyer, Alex S Holehouse, Michael A White, Barak A Cohen
{"title":"Mutational scanning of CRX classifies clinical variants and reveals biochemical properties of the transcriptional effector domain","authors":"James L Shepherdson, David M Granas, Jie Li, Zara Shariff, Stephen P Plassmeyer, Alex S Holehouse, Michael A White, Barak A Cohen","doi":"10.1101/gr.279415.124","DOIUrl":"https://doi.org/10.1101/gr.279415.124","url":null,"abstract":"The transcription factor (TF) cone-rod homeobox (CRX) is essential for the differentiation and maintenance of photoreceptor cell identity. Several human CRX variants cause degenerative retinopathies, but most are variants of uncertain significance (VUS). We performed a deep mutational scan (DMS) of nearly all possible single amino acid substitutions in CRX using a cell-based transcriptional reporter assay, curating a high-confidence list of nearly 2,000 variants with altered transcriptional activity. In the structured homeodomain, activity scores closely aligned to a predicted structure and demonstrated position-specific constraints on amino acid substitution. By contrast, the intrinsically disordered transcriptional effector domain displayed a qualitatively different pattern of substitution effects, following compositional constraints without specific residue position requirements in the peptide chain. These compositional constraints were consistent with the acidic exposure model of transcriptional activation. We evaluated the performance of the DMS assay as a clinical variant classification tool using gold-standard classified human variants from ClinVar, identifying pathogenic variants with high specificity and moderate sensitivity. That this performance could be achieved using a synthetic reporter assay in a foreign cell type, even for a highly cell type-specific TF like CRX, suggests that this approach shows promise for DMS of other TFs that function in cell types that are not easily accessible. Together, the results of the CRX DMS identify molecular features of the CRX effector domain and demonstrate utility for integration into the clinical variant classification pipeline.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"2 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}