基于深度学习的综合物种识别数据融合

IF 5.7 1区生物学 Q1 EVOLUTIONARY BIOLOGY

Systematic Biology Pub Date : 2025-06-13 DOI:10.1093/sysbio/syaf026

Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen

{"title":"基于深度学习的综合物种识别数据融合","authors":"Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen","doi":"10.1093/sysbio/syaf026","DOIUrl":null,"url":null,"abstract":"DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (&gt;96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"8 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Fusion for Integrative Species Identification Using Deep Learning\",\"authors\":\"Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen\",\"doi\":\"10.1093/sysbio/syaf026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (&gt;96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.\",\"PeriodicalId\":22120,\"journal\":{\"name\":\"Systematic Biology\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/sysbio/syaf026\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf026","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

DNA分析已经彻底改变了物种鉴定和分类工作。然而，持续的挑战来自于物种之间的微小分化和物种内部的巨大变异，特别是在密切相关的群体之间。虽然图像通常被用作自动识别任务的替代方式，但它们的可用性受到相同问题的限制。通过机器学习融合分子和图像数据的综合策略对细粒度物种识别具有重要意义。然而，关于分子和图像预处理和融合技术的系统概述和严格的统计测试，包括对生物学家的实用建议，到目前为止还缺乏。我们介绍了一种机器学习方案，该方案集成了分子和图像数据，用于物种识别。首先，我们系统地评估和比较了三种不同的DNA排列（排列，未排列，snp还原）和两种编码方法（分数，序数）。此外，利用人工神经网络提取视觉和分子特征，并提出了融合这些信息的策略。具体来说，我们研究了三种策略：I)特征提取后直接融合，II)融合特征提取后通过全连接层的特征，以及III)融合两个单峰模型的输出分数。我们对4个真核生物数据集，包括2个植物科（Asteraceae, Poaceae）和2个动物科（Lycaenidae, Coccinellidae），使用留一交叉验证（LOOCV）系统和统计地评估了这些策略。此外，我们开发了一种方法来理解分子和图像特异性识别失败。以十进制数向量编码的核苷酸序列在所有四个数据集的DNA数据预处理技术中获得了最高的识别精度。在特征提取后直接融合分子特征和视觉特征对四分之三的数据集产生了最好的结果（52-99%）。总体而言，将DNA与图像数据相结合可以显著提高4个数据集中的3个数据集的准确性，其中植物数据集的改善最为显著（Asteraceae: +19%, Poaceae: +13.6%）。即使对于基于分子数据的高鉴定准确率（>96%）的Lycaenidae，也有统计学上显著的提高（+2.1%）。对属之间和属内混淆率的详细分析表明，单靠DNA往往能正确识别属，但往往不能识别种。通过在训练中加入图像数据，可以缓解物种分辨失败的问题。这种分辨率的增加暗示了模式的层次作用，其中分子数据粗略地将标本分组，然后通过连接的图像引导到更细粒度的识别。我们首次系统地展示并解释了优化分子和图像数据的预处理和集成提供了显着的好处，特别是对于遗传相似和形态难以区分的物种，通过减少模式特异性失败率和信息差距来增强物种识别。我们的研究结果可以为不同生物群体的整合工作提供信息，从而提高真核生物物种的自动化识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Fusion for Integrative Species Identification Using Deep Learning

DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.