Pre-Meta：基于llm元数据生成的先验增强检索。

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-18 DOI:10.1093/bioinformatics/btaf519

Phil Tinn, Sondre Sørbø, Shanshan Jiang, Konstantinos Voutetakis, Sotiris Moudouris Giounis, Eleftherios Pilalis, Olga Papadodima, Dumitru Roman

{"title":"Pre-Meta：基于llm元数据生成的先验增强检索。","authors":"Phil Tinn, Sondre Sørbø, Shanshan Jiang, Konstantinos Voutetakis, Sotiris Moudouris Giounis, Eleftherios Pilalis, Olga Papadodima, Dumitru Roman","doi":"10.1093/bioinformatics/btaf519","DOIUrl":null,"url":null,"abstract":"Motivation: While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large Language Models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This paper presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.Results: Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.Availability: The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pre-Meta: Priors-augmented Retrieval for LLM-based Metadata Generation.\",\"authors\":\"Phil Tinn, Sondre Sørbø, Shanshan Jiang, Konstantinos Voutetakis, Sotiris Moudouris Giounis, Eleftherios Pilalis, Olga Papadodima, Dumitru Roman\",\"doi\":\"10.1093/bioinformatics/btaf519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large Language Models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This paper presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.Results: Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.Availability: The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf519\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机：虽然高通量测序技术极大地加速了基因组数据的生成，但数据集注释和元数据创建所需的手动过程阻碍了这些资源在不同的公共存储库中的有效发现和发布。大型语言模型（llm）具有简化数据集分析和发现的潜力。然而，他们目前在跨专业知识领域推广的局限性，特别是在生物医学基因组学等领域，阻碍了他们充分实现这一潜力。本文介绍了Pre-Meta，这是一个与llm无关且独立于领域的数据注释管道，具有丰富的检索过程，该过程利用相关的先验（如预生成的元数据标签和本体）作为辅助信息来提高自动元数据生成的准确性。结果：通过在1500篇论文中采样的5个选定的元数据字段进行验证，Pre-Meta辅助注释实验-没有进行微调和及时优化-证明了注释任务的系统性改进：通过采用gpt - 40 mini， Llama 8B和Mistral 7B的传统RAG分别获得23%，72%和75%的准确性增益。可用性：代码、数据访问和脚本可在：https://github.com/SINTEF-SE/LLMDap上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pre-Meta: Priors-augmented Retrieval for LLM-based Metadata Generation.

Motivation: While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large Language Models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This paper presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.

Results: Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.

Availability: The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量