Automatic taxonomic identification based on the Fossil Image Dataset (>415,000 images) and deep convolutional neural networks

IF 2.6 2区地球科学 Q2 BIODIVERSITY CONSERVATION

Paleobiology Pub Date : 2022-06-17 DOI:10.1017/pab.2022.14

Xiaokang Liu, Shouyi Jiang, Ruiwei Wu, Wenchao Shu, Jie Hou, Y. Sun, Jiarui Sun, Daoliang Chu, Yuyang Wu, Haijun Song

{"title":"Automatic taxonomic identification based on the Fossil Image Dataset (>415,000 images) and deep convolutional neural networks","authors":"Xiaokang Liu, Shouyi Jiang, Ruiwei Wu, Wenchao Shu, Jie Hou, Y. Sun, Jiarui Sun, Daoliang Chu, Yuyang Wu, Haijun Song","doi":"10.1017/pab.2022.14","DOIUrl":null,"url":null,"abstract":"Abstract. The rapid and accurate taxonomic identification of fossils is of great significance in paleontology, biostratigraphy, and other fields. However, taxonomic identification is often labor-intensive and tedious, and the requisition of extensive prior knowledge about a taxonomic group also requires long-term training. Moreover, identification results are often inconsistent across researchers and communities. Accordingly, in this study, we used deep learning to support taxonomic identification. We used web crawlers to collect the Fossil Image Dataset (FID) via the Internet, obtaining 415,339 images belonging to 50 fossil clades. Then we trained three powerful convolutional neural networks on a high-performance workstation. The Inception-ResNet-v2 architecture achieved an average accuracy of 0.90 in the test dataset when transfer learning was applied. The clades of microfossils and vertebrate fossils exhibited the highest identification accuracies of 0.95 and 0.90, respectively. In contrast, clades of sponges, bryozoans, and trace fossils with various morphologies or with few samples in the dataset exhibited a performance below 0.80. Visual explanation methods further highlighted the discrepancies among different fossil clades and suggested similarities between the identifications made by machine classifiers and taxonomists. Collecting large paleontological datasets from various sources, such as the literature, digitization of dark data, citizen-science data, and public data from the Internet may further enhance deep learning methods and their adoption. Such developments will also possibly lead to image-based systematic taxonomy to be replaced by machine-aided classification in the future. Pioneering studies can include microfossils and some invertebrate fossils. To contribute to this development, we deployed our model on a server for public access at www.ai-fossil.com.","PeriodicalId":54646,"journal":{"name":"Paleobiology","volume":"49 1","pages":"1 - 22"},"PeriodicalIF":2.6000,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Paleobiology","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.1017/pab.2022.14","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIODIVERSITY CONSERVATION","Score":null,"Total":0}

引用次数: 9

Abstract

Abstract. The rapid and accurate taxonomic identification of fossils is of great significance in paleontology, biostratigraphy, and other fields. However, taxonomic identification is often labor-intensive and tedious, and the requisition of extensive prior knowledge about a taxonomic group also requires long-term training. Moreover, identification results are often inconsistent across researchers and communities. Accordingly, in this study, we used deep learning to support taxonomic identification. We used web crawlers to collect the Fossil Image Dataset (FID) via the Internet, obtaining 415,339 images belonging to 50 fossil clades. Then we trained three powerful convolutional neural networks on a high-performance workstation. The Inception-ResNet-v2 architecture achieved an average accuracy of 0.90 in the test dataset when transfer learning was applied. The clades of microfossils and vertebrate fossils exhibited the highest identification accuracies of 0.95 and 0.90, respectively. In contrast, clades of sponges, bryozoans, and trace fossils with various morphologies or with few samples in the dataset exhibited a performance below 0.80. Visual explanation methods further highlighted the discrepancies among different fossil clades and suggested similarities between the identifications made by machine classifiers and taxonomists. Collecting large paleontological datasets from various sources, such as the literature, digitization of dark data, citizen-science data, and public data from the Internet may further enhance deep learning methods and their adoption. Such developments will also possibly lead to image-based systematic taxonomy to be replaced by machine-aided classification in the future. Pioneering studies can include microfossils and some invertebrate fossils. To contribute to this development, we deployed our model on a server for public access at www.ai-fossil.com.

查看原文本刊更多论文

基于化石图像数据集（>415000幅图像）和深度卷积神经网络的自动分类识别

摘要化石的快速、准确的分类鉴定在古生物学、生物地层学等领域具有重要意义。然而，分类学鉴定往往是劳动密集型和繁琐的，并且对一个分类学群体的广泛先验知识的要求也需要长期的培训。此外，研究人员和社区之间的鉴定结果往往不一致。因此，在本研究中，我们使用深度学习来支持分类鉴定。我们使用网络爬虫通过互联网收集化石图像数据集(FID)，获得了属于50个化石枝的415,339张图像。然后，我们在高性能工作站上训练了三个强大的卷积神经网络。应用迁移学习时，Inception-ResNet-v2架构在测试数据集中的平均准确率为0.90。微化石和脊椎动物化石的分类精度最高，分别为0.95和0.90。相比之下，海绵、苔藓虫和痕迹化石的进化枝具有不同的形态或在数据集中样本较少，其性能低于0.80。视觉解释方法进一步强调了不同化石分支之间的差异，并表明机器分类器和分类学家的识别之间存在相似之处。从各种来源收集大型古生物数据集，如文献、暗数据数字化、公民科学数据和互联网公共数据，可以进一步增强深度学习方法及其采用。这样的发展也可能导致基于图像的系统分类法在未来被机器辅助分类法所取代。开创性的研究包括微化石和一些无脊椎动物化石。为了促进这一开发，我们将模型部署在服务器上，供公众访问www.ai-fossil.com。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Paleobiology 地学-古生物学

CiteScore

5.30

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Paleobiology publishes original contributions of any length (but normally 10-50 manuscript pages) dealing with any aspect of biological paleontology. Emphasis is placed on biological or paleobiological processes and patterns, including macroevolution, extinction, diversification, speciation, functional morphology, bio-geography, phylogeny, paleoecology, molecular paleontology, taphonomy, natural selection and patterns of variation, abundance, and distribution in space and time, among others. Taxonomic papers are welcome if they have significant and broad applications. Papers concerning research on recent organisms and systems are appropriate if they are of particular interest to paleontologists. Papers should typically interest readers from more than one specialty. Proposals for symposium volumes should be discussed in advance with the editors.