PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2017-11-01 DOI:10.1109/ICDAR.2017.180

Po-Shen Lee, Sean T. Yang, Jevin D. West, Bill Howe

{"title":"PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms","authors":"Po-Shen Lee, Sean T. Yang, Jevin D. West, Bill Howe","doi":"10.1109/ICDAR.2017.180","DOIUrl":null,"url":null,"abstract":"We consider a new approach to extracting information from dendrograms in the biological literature representing phylogenetic trees. Existing algorithmic approaches to extract these relationships rely on tracing tree contours and are very sensitive to image quality issues, but manual approaches require significant human effort and cannot be used at scale. We introduce PhyloParser, a fully automated, end-to-end system for automatically extracting species relationships from phylogenetic tree diagrams using a multi-modal approach to digest diverse tree styles. Our approach automatically identifies phylogenetic tree figures in the scientific literature, extracts the key components of tree structure, reconstructs the tree, and recovers the species relationships. We use multiple methods to extract tree components with high recall, then filter false positives by applying topological heuristics about how these components fit together. We present an evaluation on a real-world dataset to quantitatively and qualitatively demonstrate the efficacy of our approach. Our classifier achieves 89% recall and 99% precision, with a low average error rate relative to previous approaches. We aim to use PhyloParser to build a linked, open, comprehensive database of phylogenetic information that covers the historical literature as well as current data, and then use this resource to identify areas of disagreement and poor coverage in the biological literature.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

We consider a new approach to extracting information from dendrograms in the biological literature representing phylogenetic trees. Existing algorithmic approaches to extract these relationships rely on tracing tree contours and are very sensitive to image quality issues, but manual approaches require significant human effort and cannot be used at scale. We introduce PhyloParser, a fully automated, end-to-end system for automatically extracting species relationships from phylogenetic tree diagrams using a multi-modal approach to digest diverse tree styles. Our approach automatically identifies phylogenetic tree figures in the scientific literature, extracts the key components of tree structure, reconstructs the tree, and recovers the species relationships. We use multiple methods to extract tree components with high recall, then filter false positives by applying topological heuristics about how these components fit together. We present an evaluation on a real-world dataset to quantitatively and qualitatively demonstrate the efficacy of our approach. Our classifier achieves 89% recall and 99% precision, with a low average error rate relative to previous approaches. We aim to use PhyloParser to build a linked, open, comprehensive database of phylogenetic information that covers the historical literature as well as current data, and then use this resource to identify areas of disagreement and poor coverage in the biological literature.

查看原文本刊更多论文

PhyloParser:一种从树形图中提取系统发生的混合算法

我们考虑了一种新的方法来提取信息从树状图在生物学文献代表系统发育树。现有的提取这些关系的算法方法依赖于跟踪树的轮廓，并且对图像质量问题非常敏感，但是手动方法需要大量的人力，并且不能大规模使用。我们介绍了PhyloParser，这是一个全自动的端到端系统，用于使用多模态方法从系统发育树图中自动提取物种关系，以消化不同的树样式。我们的方法自动识别科学文献中的系统发育树图，提取树结构的关键成分，重建树，恢复物种关系。我们使用多种方法提取具有高召回率的树成分，然后通过应用拓扑启发式方法来过滤假阳性。我们对真实世界的数据集进行了评估，以定量和定性地证明我们的方法的有效性。我们的分类器达到了89%的召回率和99%的准确率，相对于以前的方法，平均错误率很低。我们的目标是使用PhyloParser建立一个链接的、开放的、全面的系统发育信息数据库，涵盖历史文献和当前数据，然后使用该资源来识别生物学文献中存在分歧和覆盖不足的领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量