生物文献中标题图挖掘的结构化对应主题模型。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Pub Date : 2009-01-01 DOI:10.1145/1557019.1557031

Amr Ahmed, Eric P Xing, William W Cohen, Robert F Murphy

{"title":"生物文献中标题图挖掘的结构化对应主题模型。","authors":"Amr Ahmed, Eric P Xing, William W Cohen, Robert F Murphy","doi":"10.1145/1557019.1557031","DOIUrl":null,"url":null,"abstract":"A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":"2009 ","pages":"39-48"},"PeriodicalIF":0.0000,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/1557019.1557031","citationCount":"38","resultStr":"{\"title\":\"Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature.\",\"authors\":\"Amr Ahmed, Eric P Xing, William W Cohen, Robert F Murphy\",\"doi\":\"10.1145/1557019.1557031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.\",\"PeriodicalId\":74037,\"journal\":{\"name\":\"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining\",\"volume\":\"2009 \",\"pages\":\"39-48\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/1557019.1557031\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1557019.1557031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1557019.1557031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

来自科学期刊、会议记录和书籍的学术文章中的主要信息来源(通常是最关键和最翔实的部分)是直接提供关键实验结果和其他科学内容的图像和其他图形插图的数字。在生物学文章中，一个典型的图形通常由多个面板组成，并附有范围或全局标题文本。此外，标题中的文本包含与图中图像相关的重要语义实体，如蛋白质名称、基因本体、组织标签等。近年来，由于生物文献的大量涌现和各种生物成像技术的日益普及，从文献数据中自动检索和总结生物信息已成为生命科学计算知识提取和管理的一个重大挑战。本文提出了一种基于真实图形生成方案的结构化概率主题模型，用于结构化注释生物图形的建模，并推导了一种基于折叠Gibbs抽样的高效推理算法，用于信息检索和可视化。由此产生的程序构成了我们SLIF系统的关键IR引擎之一，该系统最近进入了爱思唯尔生命科学知识增强大挑战的最后一轮(70个竞争系统中的4个)。在这里，我们展示了对许多数据挖掘任务的各种评估，以说明我们的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature.

A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量