Robel Kahsay, Urnisha Bhuiyan, Cyrus Chun Hong Au, Nathan Edwards, Luke Johnson, Sujeet Kulkarni, Karina Martinez, Rene Ranzinger, K Vijay-Shanker, Jeet Vora, Kate Warner, Michael Tiemeyer, Raja Mazumder
{"title":"GlycoSiteMiner:一个ML/ ai辅助的基于文献挖掘的管道,用于从PubMed摘要中提取糖基化位点。","authors":"Robel Kahsay, Urnisha Bhuiyan, Cyrus Chun Hong Au, Nathan Edwards, Luke Johnson, Sujeet Kulkarni, Karina Martinez, Rene Ranzinger, K Vijay-Shanker, Jeet Vora, Kate Warner, Michael Tiemeyer, Raja Mazumder","doi":"10.1093/glycob/cwaf030","DOIUrl":null,"url":null,"abstract":"<p><p>Over 50% of human proteins are estimated to be glycosylated, making glycosylation one of the most common post-translational modifications (PTMs) of proteins. A glycoinformatics resource such as the GlyGen knowledgebase, consisting of experimentally verified sequence-specific glycosylation sites, is critical for advancing research in glycobiology. Unfortunately, most experimental studies report glycosylation sites in free text format in scientific literature, mentioning gene names and amino acid positions without providing protein sequence identifiers, making it difficult to mine reported sites that can be mapped onto specific protein sequences. We have developed GlycoSiteMiner, which is an automated literature mining-based pipeline that extracts experimentally verified protein sequence-specific glycosylation sites from PubMed abstracts. The pipeline employs ML/AI algorithms to filter out incorrectly identified sites and has been applied to 33 million PubMed abstracts, identifying 1118 new sequence-specific glycosylation sites that were not previously present in the GlyGen resource.</p>","PeriodicalId":12766,"journal":{"name":"Glycobiology","volume":" ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12130968/pdf/","citationCount":"0","resultStr":"{\"title\":\"GlycoSiteMiner: an ML/AI-assisted literature mining-based pipeline for extracting glycosylation sites from PubMed abstracts.\",\"authors\":\"Robel Kahsay, Urnisha Bhuiyan, Cyrus Chun Hong Au, Nathan Edwards, Luke Johnson, Sujeet Kulkarni, Karina Martinez, Rene Ranzinger, K Vijay-Shanker, Jeet Vora, Kate Warner, Michael Tiemeyer, Raja Mazumder\",\"doi\":\"10.1093/glycob/cwaf030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Over 50% of human proteins are estimated to be glycosylated, making glycosylation one of the most common post-translational modifications (PTMs) of proteins. A glycoinformatics resource such as the GlyGen knowledgebase, consisting of experimentally verified sequence-specific glycosylation sites, is critical for advancing research in glycobiology. Unfortunately, most experimental studies report glycosylation sites in free text format in scientific literature, mentioning gene names and amino acid positions without providing protein sequence identifiers, making it difficult to mine reported sites that can be mapped onto specific protein sequences. We have developed GlycoSiteMiner, which is an automated literature mining-based pipeline that extracts experimentally verified protein sequence-specific glycosylation sites from PubMed abstracts. The pipeline employs ML/AI algorithms to filter out incorrectly identified sites and has been applied to 33 million PubMed abstracts, identifying 1118 new sequence-specific glycosylation sites that were not previously present in the GlyGen resource.</p>\",\"PeriodicalId\":12766,\"journal\":{\"name\":\"Glycobiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12130968/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Glycobiology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/glycob/cwaf030\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Glycobiology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/glycob/cwaf030","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
GlycoSiteMiner: an ML/AI-assisted literature mining-based pipeline for extracting glycosylation sites from PubMed abstracts.
Over 50% of human proteins are estimated to be glycosylated, making glycosylation one of the most common post-translational modifications (PTMs) of proteins. A glycoinformatics resource such as the GlyGen knowledgebase, consisting of experimentally verified sequence-specific glycosylation sites, is critical for advancing research in glycobiology. Unfortunately, most experimental studies report glycosylation sites in free text format in scientific literature, mentioning gene names and amino acid positions without providing protein sequence identifiers, making it difficult to mine reported sites that can be mapped onto specific protein sequences. We have developed GlycoSiteMiner, which is an automated literature mining-based pipeline that extracts experimentally verified protein sequence-specific glycosylation sites from PubMed abstracts. The pipeline employs ML/AI algorithms to filter out incorrectly identified sites and has been applied to 33 million PubMed abstracts, identifying 1118 new sequence-specific glycosylation sites that were not previously present in the GlyGen resource.
期刊介绍:
Established as the leading journal in the field, Glycobiology provides a unique forum dedicated to research into the biological functions of glycans, including glycoproteins, glycolipids, proteoglycans and free oligosaccharides, and on proteins that specifically interact with glycans (including lectins, glycosyltransferases, and glycosidases).
Glycobiology is essential reading for researchers in biomedicine, basic science, and the biotechnology industries. By providing a single forum, the journal aims to improve communication between glycobiologists working in different disciplines and to increase the overall visibility of the field.