{"title":"利用半自动化管道从昆虫馆藏大规模数字化中提取高通量印刷标本标签信息","authors":"Margot Belot, Leonardo Preuss, Joël Tuberosa, Magdalena Claessen, Olha Svezhentseva, Franziska Schuster, Christian Bölling, Théo Léger","doi":"10.3897/biss.7.112466","DOIUrl":null,"url":null,"abstract":"Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline, we performed benchmarking experiments using a set of images similar to those the models were trained on, as well as additional image sets obtained from various museum collections. Our pipeline offers several advantages, streamlining the data entry process, and reducing manual extraction time and effort, while also minimizing potential human errors and inconsistencies in label transcription. The pipeline holds the promise of accelerating metadata extraction from insect specimens, promoting scientific research and enabling large-scale analyses to achieve a more profound understanding of the collections.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline\",\"authors\":\"Margot Belot, Leonardo Preuss, Joël Tuberosa, Magdalena Claessen, Olha Svezhentseva, Franziska Schuster, Christian Bölling, Théo Léger\",\"doi\":\"10.3897/biss.7.112466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline, we performed benchmarking experiments using a set of images similar to those the models were trained on, as well as additional image sets obtained from various museum collections. Our pipeline offers several advantages, streamlining the data entry process, and reducing manual extraction time and effort, while also minimizing potential human errors and inconsistencies in label transcription. The pipeline holds the promise of accelerating metadata extraction from insect specimens, promoting scientific research and enabling large-scale analyses to achieve a more profound understanding of the collections.\",\"PeriodicalId\":9011,\"journal\":{\"name\":\"Biodiversity Information Science and Standards\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodiversity Information Science and Standards\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3897/biss.7.112466\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112466","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
昆虫占地球上已被描述的生物总数的一半,还有大量的物种有待描述。昆虫在生态系统中发挥着重要作用,但仍受到栖息地破坏、集约化农业和气候变化的威胁。世界各地的博物馆收藏了数以百万计的昆虫标本,并开展了大规模的数字化举措,如数字化街道!最近在自然之光博物馆(Museum r Naturkunde)进行了一项研究,以解锁这些数据。准确、高效地提取昆虫标本标签信息对于建立全面的数据库、促进科学调查、收集数据的可持续性和有效的知识转移至关重要。尽管标本及其标签的高通量成像技术取得了进步,但转录标签信息的过程仍然主要是手动的,落后于数字化工作的步伐。为了解决这一问题,我们提出了一个三步半自动化流水线,重点从单个昆虫标签中提取和处理信息。我们的解决方案主要是为印刷昆虫标签设计的,因为OCR(光学字符识别)技术对印刷文本表现良好,而手写文本仍然产生混合结果。该管道结合了计算机视觉(CV)技术、OCR和聚类算法。我们的管道的初始阶段包括使用卷积神经网络(CNN)模型进行图像分析。该模型使用来自三个不同昆虫标签数据集的2100幅图像进行训练,即AntWeb(来自各种收集的蚂蚁标本标签),Bees和amp;Bytes(来自Museum fr Naturkunde的蜜蜂标本标签)和LEP_PHIL(来自Museum fr Naturkunde的鳞翅目标本标签)。第一个模型能够识别和隔离图像中的单个标签,有效地将标签区域与图像的其余部分分割开来,并将它们裁剪成多个新的单标签图像文件。它还将标签分配给不同的类别,即印刷文本或手写文本,手写标签从印刷标签中分类。在第二步中,分类为“打印”的标签随后由OCR引擎解析,以从标签中提取文本信息。我们对Tesseract和Google Vision ocr进行了测试,以评估它们的性能。虽然谷歌视觉OCR是一种基于云的服务,可配置性有限,但Tesseract提供了微调设置的灵活性,并为我们的特定用例提高了性能。在第三步中,使用聚类算法根据相似性对OCR输出进行聚合。这一步允许识别和形成由共享相同或高度相似内容的标签组成的集群。最后,将这些集群与精心策划的标签数据库进行比较,并将它们分配给已知的标签,或者突出显示为新的标签,并手动添加到数据库中。为了评估我们的管道的效率,我们使用一组与模型训练的图像相似的图像进行基准测试实验,以及从各种博物馆藏品中获得的额外图像集。我们的管道提供了几个优势,简化了数据输入过程,减少了人工提取的时间和精力,同时也最大限度地减少了标签转录中潜在的人为错误和不一致。该管道有望加速从昆虫标本中提取元数据,促进科学研究,并使大规模分析能够更深入地了解这些标本。
High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline
Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline, we performed benchmarking experiments using a set of images similar to those the models were trained on, as well as additional image sets obtained from various museum collections. Our pipeline offers several advantages, streamlining the data entry process, and reducing manual extraction time and effort, while also minimizing potential human errors and inconsistencies in label transcription. The pipeline holds the promise of accelerating metadata extraction from insect specimens, promoting scientific research and enabling large-scale analyses to achieve a more profound understanding of the collections.