GigaScience最新文献

筛选
英文 中文
scShapes: a statistical framework for identifying distribution shapes in single-cell RNA-sequencing data. scShapes:用于识别单细胞 RNA 序列数据分布形状的统计框架。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-01-24 DOI: 10.1093/gigascience/giac126
Malindrie Dharmaratne, Ameya S Kulkarni, Atefeh Taherian Fard, Jessica C Mar
{"title":"scShapes: a statistical framework for identifying distribution shapes in single-cell RNA-sequencing data.","authors":"Malindrie Dharmaratne, Ameya S Kulkarni, Atefeh Taherian Fard, Jessica C Mar","doi":"10.1093/gigascience/giac126","DOIUrl":"10.1093/gigascience/giac126","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) methods have been advantageous for quantifying cell-to-cell variation by profiling the transcriptomes of individual cells. For scRNA-seq data, variability in gene expression reflects the degree of variation in gene expression from one cell to another. Analyses that focus on cell-cell variability therefore are useful for going beyond changes based on average expression and, instead, identifying genes with homogeneous expression versus those that vary widely from cell to cell.</p><p><strong>Results: </strong>We present a novel statistical framework, scShapes, for identifying differential distributions in single-cell RNA-sequencing data using generalized linear models. Most approaches for differential gene expression detect shifts in the mean value. However, as single-cell data are driven by overdispersion and dropouts, moving beyond means and using distributions that can handle excess zeros is critical. scShapes quantifies gene-specific cell-to-cell variability by testing for differences in the expression distribution while flexibly adjusting for covariates if required. We demonstrate that scShapes identifies subtle variations that are independent of altered mean expression and detects biologically relevant genes that were not discovered through standard approaches.</p><p><strong>Conclusions: </strong>This analysis also draws attention to genes that switch distribution shapes from a unimodal distribution to a zero-inflated distribution and raises open questions about the plausible biological mechanisms that may give rise to this, such as transcriptional bursting. Overall, the results from scShapes help to expand our understanding of the role that gene expression plays in the transcriptional regulation of a specific perturbation or cellular phenotype. Our framework scShapes is incorporated into a Bioconductor R package (https://www.bioconductor.org/packages/release/bioc/html/scShapes.html).</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9871437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10589393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction MuLan-Methyl——基于多变换器的精确DNA甲基化预测语言模型
IF 9.2 2区 生物学
GigaScience Pub Date : 2022-12-28 DOI: 10.1101/2023.01.04.522704
Wenhuan Zeng, A. Gautam, D. Huson
{"title":"MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction","authors":"Wenhuan Zeng, A. Gautam, D. Huson","doi":"10.1101/2023.01.04.522704","DOIUrl":"https://doi.org/10.1101/2023.01.04.522704","url":null,"abstract":"Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach. Key points MuLan-Methyl aims at identifying three types of DNA-methylation sites. It uses an ensemble of five transformer-based language models, which were pre-trained and fine-tuned on a custom corpus. The self-attention mechanism of transformers give rise to importance scores, which can be used to extract motifs. The method performs favorably in comparison to existing methods. The implementation can be applied to chromosomal sequences to predict methylation sites.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46733527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Finding haplotypic signatures in proteins 在蛋白质中发现单倍型特征
IF 9.2 2区 生物学
GigaScience Pub Date : 2022-12-28 DOI: 10.1101/2022.11.21.517096
J. Vašíček, Dafni Skiadopoulou, K. Kuznetsova, Bo Wen, S. Johansson, P. Njølstad, Stefan Bruckner, L. Käll, Marc Vaudel
{"title":"Finding haplotypic signatures in proteins","authors":"J. Vašíček, Dafni Skiadopoulou, K. Kuznetsova, Bo Wen, S. Johansson, P. Njølstad, Stefan Bruckner, L. Käll, Marc Vaudel","doi":"10.1101/2022.11.21.517096","DOIUrl":"https://doi.org/10.1101/2022.11.21.517096","url":null,"abstract":"The non-random distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genetic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples, and detectable by mass spectrometry, but are not accounted for in proteomic searches. Consequently, the impact of haplotypic variation on the results of proteomic searches, and the discoverability of peptides specific to haplotypes remain unknown. Here, we study how common genetic haplotypes influence the proteomic search space and investigate the possibility to match peptides containing multiple amino acid substitutions to a publicly available data set of mass spectra. We found that for 9.96 % of the discoverable amino acid substitutions encoded by common haplotypes, two or more substitutions may co-occur in the same peptide after tryptic digestion of the protein haplotypes. We identified 342 spectra that matched to such multi-variant peptides, and out of the 4,251 amino acid substitutions identified, 6.63 % were covered by multi-variant peptides. However, the evaluation of the reliability of these matches remains challenging, suggesting that refined error rate estimation procedures are needed for such complex proteomic searches. As these become available and the ability to analyze protein haplotypes increases, we anticipate that proteomics will provide new information on the consequences of common variation, across tissues and time.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90053950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parvovirus dark matter in the cloaca of wild birds. 野生鸟类泄殖腔中的副病毒暗物质。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-02-03 DOI: 10.1093/gigascience/giad001
Ziyuan Dai, Haoning Wang, Haisheng Wu, Qing Zhang, Likai Ji, Xiaochun Wang, Quan Shen, Shixing Yang, Xiao Ma, Tongling Shan, Wen Zhang
{"title":"Parvovirus dark matter in the cloaca of wild birds.","authors":"Ziyuan Dai, Haoning Wang, Haisheng Wu, Qing Zhang, Likai Ji, Xiaochun Wang, Quan Shen, Shixing Yang, Xiao Ma, Tongling Shan, Wen Zhang","doi":"10.1093/gigascience/giad001","DOIUrl":"10.1093/gigascience/giad001","url":null,"abstract":"<p><p>With the development of viral metagenomics and next-generation sequencing technology, more and more novel parvoviruses have been identified in recent years, including even entirely new lineages. The Parvoviridae family includes a different group of viruses that can infect a wide variety of animals. In this study, systematic analysis was performed to identify the \"dark matter\" (datasets that cannot be easily attributed to known viruses) of parvoviruses and to explore their genetic diversity from wild birds' cloacal swab samples. We have tentatively defined this parvovirus \"dark matter\" as a highly divergent lineage in the Parvoviridae family. All parvoviruses showed several characteristics, including 2 major protein-coding genes and similar genome lengths. Moreover, we observed that the novel parvo-like viruses share similar genome organizations to most viruses in Parvoviridae but could not clustered with the established subfamilies in phylogenetic analysis. We also found some new members associated with the Bidnaviridae family, which may be derived from parvovirus. This suggests that systematic analysis of domestic and wild animal samples is necessary to explore the genetic diversity of parvoviruses and to mine for more of this potential dark matter.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9896142/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9236822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DivBrowse-interactive visualization and exploratory data analysis of variant call matrices. DivBrowse - 变体调用矩阵的交互式可视化和探索性数据分析。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-04-21 DOI: 10.1093/gigascience/giad025
Patrick König, Sebastian Beier, Martin Mascher, Nils Stein, Matthias Lange, Uwe Scholz
{"title":"DivBrowse-interactive visualization and exploratory data analysis of variant call matrices.","authors":"Patrick König, Sebastian Beier, Martin Mascher, Nils Stein, Matthias Lange, Uwe Scholz","doi":"10.1093/gigascience/giad025","DOIUrl":"10.1093/gigascience/giad025","url":null,"abstract":"<p><strong>Background: </strong>The sequencing of whole genomes is becoming increasingly affordable. In this context, large-scale sequencing projects are generating ever larger datasets of species-specific genomic diversity. As a consequence, more and more genomic data need to be made easily accessible and analyzable to the scientific community.</p><p><strong>Findings: </strong>We present DivBrowse, a web application for interactive visualization and exploratory analysis of genomic diversity data stored in Variant Call Format (VCF) files of any size. By seamlessly combining BLAST as an entry point together with interactive data analysis features such as principal component analysis in one graphical user interface, DivBrowse provides a novel and unique set of exploratory data analysis capabilities for genomic biodiversity datasets. The capability to integrate DivBrowse into existing web applications supports interoperability between different web applications. Built-in interactive computation of principal component analysis allows users to perform ad hoc analysis of the population structure based on specific genetic elements such as genes and exons. Data interoperability is supported by the ability to export genomic diversity data in VCF and General Feature Format 3 files.</p><p><strong>Conclusion: </strong>DivBrowse offers a novel approach for interactive visualization and analysis of genomic diversity data and optionally also gene annotation data by including features like interactive calculation of variant frequencies and principal component analysis. The use of established standard file formats for data input supports interoperability and seamless deployment of application instances based on the data output of established bioinformatics pipelines.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10120423/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9415057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Regulatory Mendelian Mutation score for GRCh38. GRCh38 的调节性孟德尔突变得分。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-04-21 DOI: 10.1093/gigascience/giad024
Max Schubach, Lusiné Nazaretyan, Martin Kircher
{"title":"The Regulatory Mendelian Mutation score for GRCh38.","authors":"Max Schubach, Lusiné Nazaretyan, Martin Kircher","doi":"10.1093/gigascience/giad024","DOIUrl":"10.1093/gigascience/giad024","url":null,"abstract":"<p><strong>Background: </strong>Genome sequencing efforts for individuals with rare Mendelian disease have increased the research focus on the noncoding genome and the clinical need for methods that prioritize potentially disease causal noncoding variants. Some tools for assessment of variant pathogenicity as well as annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software, and pipelines was slow.</p><p><strong>Results: </strong>Here, we present an updated version of the Regulatory Mendelian Mutation (ReMM) score, retrained on features and variants derived from the GRCh38 genome build. Like its GRCh37 version, it achieves good performance on its highly imbalanced data. To improve accessibility and provide users with a toolbox to score their variant files and look up scores in the genome, we developed a website and API for easy score lookup.</p><p><strong>Conclusions: </strong>Scores of the GRCh38 genome build are highly correlated to the prior release with a performance increase due to the better coverage of features. For prioritization of noncoding mutations in imbalanced datasets, the ReMM score performed much better than other variation scores. Prescored whole-genome files of GRCh37 and GRCh38 genome builds are cited in the article and the website; UCSC genome browser tracks, and an API are available at https://remm.bihealth.org.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10120424/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9421528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chromosome-level genome assembly of goose provides insight into the adaptation and growth of local goose breeds. 鹅染色体组水平的基因组组装有助于深入了解当地鹅品种的适应性和生长情况。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-02-03 DOI: 10.1093/gigascience/giad003
Qiqi Zhao, Zhenping Lin, Junpeng Chen, Zi Xie, Jun Wang, Keyu Feng, Wencheng Lin, Hongxin Li, Zezhong Hu, Weiguo Chen, Feng Chen, Muhammad Junaid, Huanmin Zhang, Qingmei Xie, Xinheng Zhang
{"title":"Chromosome-level genome assembly of goose provides insight into the adaptation and growth of local goose breeds.","authors":"Qiqi Zhao, Zhenping Lin, Junpeng Chen, Zi Xie, Jun Wang, Keyu Feng, Wencheng Lin, Hongxin Li, Zezhong Hu, Weiguo Chen, Feng Chen, Muhammad Junaid, Huanmin Zhang, Qingmei Xie, Xinheng Zhang","doi":"10.1093/gigascience/giad003","DOIUrl":"10.1093/gigascience/giad003","url":null,"abstract":"<p><strong>Background: </strong>Anatidae contains numerous waterfowl species with great economic value, but the genetic diversity basis remains insufficiently investigated. Here, we report a chromosome-level genome assembly of Lion-head goose (Anser cygnoides), a native breed in South China, through the combination of PacBio, Bionano, and Hi-C technologies.</p><p><strong>Findings: </strong>The assembly had a total genome size of 1.19 Gb, consisting of 1,859 contigs with an N50 length of 20.59 Mb, generating 40 pseudochromosomes, representing 97.27% of the assembled genome, and identifying 21,208 protein-coding genes. Comparative genomic analysis revealed that geese and ducks diverged approximately 28.42 million years ago, and geese have undergone massive gene family expansion and contraction. To identify genetic markers associated with body weight in different geese breeds, including Wuzong goose, Huangzong goose, Magang goose, and Lion-head goose, a genome-wide association study was performed, yielding an average of 1,520.6 Mb of raw data that detected 44,858 single-mucleotide polymorphisms (SNPs). Genome-wide association study showed that 6 SNPs were significantly associated with body weight and 25 were potentially associated. The significantly associated SNPs were annotated as LDLRAD4, GPR180, and OR, enriching in growth factor receptor regulation pathways.</p><p><strong>Conclusions: </strong>We present the first chromosome-level assembly of the Lion-head goose genome, which will expand the genomic resources of the Anatidae family, providing a basis for adaptation and evolution. Candidate genes significantly associated with different goose breeds may serve to understand the underlying mechanisms of weight differences.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9896136/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10734979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genome assembly of 3 Amazonian Morpho butterfly species reveals Z-chromosome rearrangements between closely related species living in sympatry. 3 个亚马逊森蝶物种的基因组组装揭示了生活在同域的近亲物种之间的 Z 染色体重排。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-05-22 DOI: 10.1093/gigascience/giad033
Héloïse Bastide, Manuela López-Villavicencio, David Ogereau, Joanna Lledo, Anne-Marie Dutrillaux, Vincent Debat, Violaine Llaurens
{"title":"Genome assembly of 3 Amazonian Morpho butterfly species reveals Z-chromosome rearrangements between closely related species living in sympatry.","authors":"Héloïse Bastide, Manuela López-Villavicencio, David Ogereau, Joanna Lledo, Anne-Marie Dutrillaux, Vincent Debat, Violaine Llaurens","doi":"10.1093/gigascience/giad033","DOIUrl":"10.1093/gigascience/giad033","url":null,"abstract":"<p><p>The genomic processes enabling speciation and species coexistence in sympatry are still largely unknown. Here we describe the whole-genome sequencing and assembly of 3 closely related species from the butterfly genus Morpho: Morpho achilles (Linnaeus, 1758), Morpho helenor (Cramer, 1776), and Morpho deidamia (Höbner, 1819). These large blue butterflies are emblematic species of the Amazonian rainforest. They live in sympatry in a wide range of their geographical distribution and display parallel diversification of dorsal wing color pattern, suggesting local mimicry. By sequencing, assembling, and annotating their genomes, we aim at uncovering prezygotic barriers preventing gene flow between these sympatric species. We found a genome size of  480 Mb for the 3 species and a chromosomal number ranging from 2n = 54 for M. deidamia to 2n = 56 for M. achilles and M. helenor. We also detected inversions on the sex chromosome Z that were differentially fixed between species, suggesting that chromosomal rearrangements may contribute to their reproductive isolation. The annotation of their genomes allowed us to recover in each species at least 12,000 protein-coding genes and to discover duplications of genes potentially involved in prezygotic isolation like genes controlling color discrimination (L-opsin). Altogether, the assembly and the annotation of these 3 new reference genomes open new research avenues into the genomic architecture of speciation and reinforcement in sympatry, establishing Morpho butterflies as a new eco-evolutionary model.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10202424/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9670944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. MuLan-Methyl--基于多个转换器的语言模型,用于准确预测 DNA 甲基化。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-07-25 DOI: 10.1093/gigascience/giad054
Wenhuan Zeng, Anupam Gautam, Daniel H Huson
{"title":"MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction.","authors":"Wenhuan Zeng, Anupam Gautam, Daniel H Huson","doi":"10.1093/gigascience/giad054","DOIUrl":"10.1093/gigascience/giad054","url":null,"abstract":"<p><p>Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the \"pretrain and fine-tune\" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367125/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9877574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ARA: a flexible pipeline for automated exploration of NCBI SRA datasets. ARA:自动探索 NCBI SRA 数据集的灵活管道。
IF 11.8 2区 生物学
GigaScience Pub Date : 2022-12-28 Epub Date: 2023-08-17 DOI: 10.1093/gigascience/giad067
Anand Maurya, Maciej Szymanski, Wojciech M Karlowski
{"title":"ARA: a flexible pipeline for automated exploration of NCBI SRA datasets.","authors":"Anand Maurya, Maciej Szymanski, Wojciech M Karlowski","doi":"10.1093/gigascience/giad067","DOIUrl":"10.1093/gigascience/giad067","url":null,"abstract":"<p><strong>Background: </strong>One of the most effective and useful methods to explore the content of biological databases is searching with nucleotide or protein sequences as a query. However, especially in the case of nucleic acids, due to the large volume of data generated by the next-generation sequencing (NGS) technologies, this approach is often not available. The hierarchical organization of the NGS records is primarily designed for browsing or text-based searches of the information provided in metadata-related keywords, limiting the efficiency of database exploration.</p><p><strong>Findings: </strong>We developed an automated pipeline that incorporates the well-established NGS data-processing tools and procedures to allow easy and effective sampling of the NCBI SRA database records. Given a file with query nucleotide sequences, our tool estimates the matching content of SRA accessions by probing only a user-defined fraction of a record's sequences. Based on the selected parameters, it allows performing a full mapping experiment with records that meet the required criteria. The pipeline is designed to be easy to operate-it offers a fully automatic setup procedure and is fixed on tested supporting tools. The modular design and implemented usage modes allow a user to scale up the analyses into complex computational infrastructure.</p><p><strong>Conclusions: </strong>We present an easy-to-operate and automated tool that expands the way a user can access and explore the information contained within the records deposited in the NCBI SRA database.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10433097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10048841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信