Arianna Salili-James, Ben Scott, Laurence Livermore, Ben Price, Steen Dupont, Helen Hardy, Vincent Smith
{"title":"AI-Accelerated Digitisation of Insect Collections: The next generation of Angled Label Image Capture Equipment (ALICE)","authors":"Arianna Salili-James, Ben Scott, Laurence Livermore, Ben Price, Steen Dupont, Helen Hardy, Vincent Smith","doi":"10.3897/biss.7.112742","DOIUrl":null,"url":null,"abstract":"The digitisation of natural science specimens is a shared ambition of many of the largest collections, but the scale of these collections, estimated at at least 1.1 billion specimens (Johnson et al. 2023), continues to challenge even the most resource-rich organisations. The Natural History Museum, London (NHM) has been pioneering work to accelerate the digitisation of its 80 million specimens. Since the inception of the NHM Digital Collection Programme in 2014, more than 5.5 million specimen records have been made digitally accessible. This has enabled the museum to deliver a tenfold increase in digitisation, compared to when rates were first measured by the NHM in 2008. Even with this investment, it will take circa 150 years to digitise its remaining collections, leading the museum to pursue technology-led solutions alongside increased funding to deliver the next increase in digitisation rate. Insects comprise approximately half of all described species and, at the NHM, represent more than one-third (c. 30 million specimens) of the NHM’s overall collection. Their most common preservation method, attached to a pin alongside a series of labels with metadata, makes insect specimens challenging to digitise. Early Artificial Intelligence (AI)-led innovations (Price et al. 2018) resulted in the development of ALICE, the museum's Angled Label Image Capture Equipment, in which a pinned specimen is placed inside a multi-camera setup, which captures a series of partial views of a specimen and its labels. Centred around the pin, these images can be digitally combined and reconstructed, using the accompanying ALICE software, to provide a clean image of each label. To do this, a Convolutional Neural Network (CNN) model is incorporated, to locate all labels within the images. This is followed by various image processing tools to transform the labels into a two-dimensional viewpoint, align the associated label images together, and merge them into one label. This allows users to manually, or computationally (e.g., using Optical Character Recognition [OCR] tools) extract label data from the processed label images (Salili-James et al. 2022). With the ALICE setup, a user might average imaging 800 digitised specimens per day, and exceptionally, up to 1,300. This compares with an average of 250 specimens or fewer daily, using more traditional methods involving separating the labels and photographing them off of the pin. Despite this, our original version of ALICE was only suited to a small subset of the collection. In situations when the specimen is very large, there are too many labels, or these labels are too close together, ALICE fails (Dupont and Price 2019). Using a combination of updated AI processing tools, we hereby present ALICE version 2. This new version of ALICE provides faster rates, improved software accuracy, and a more streamlined pipeline. It includes the following updates: Hardware: after conducting various tests, we have optimised the camera setup. Further hardware updates include a Light-Emitting Diode (LED) ring light, as well as modifications to the camera mounting. Software: our latest software incorporates machine learning and other computer vision tools to segment labels from ALICE images and stitch them together more quickly and with a higher level of accuracy, significantly reducing the image processing failure rate. These processed label images can be combined with the latest OCR tools for automatic transcription and data segmentation. Buildkit: we aim to provide a toolkit that any individual or institution can incorporate into their digitisation pipeline. This includes hardware instructions, an extensive guide detailing the pipeline, and new software code accessible via Github. Hardware: after conducting various tests, we have optimised the camera setup. Further hardware updates include a Light-Emitting Diode (LED) ring light, as well as modifications to the camera mounting. Software: our latest software incorporates machine learning and other computer vision tools to segment labels from ALICE images and stitch them together more quickly and with a higher level of accuracy, significantly reducing the image processing failure rate. These processed label images can be combined with the latest OCR tools for automatic transcription and data segmentation. Buildkit: we aim to provide a toolkit that any individual or institution can incorporate into their digitisation pipeline. This includes hardware instructions, an extensive guide detailing the pipeline, and new software code accessible via Github. We provide test data and workflows to demonstrate the potential of ALICE version 2 as an effective, accessible, and cost-saving solution to digitising pinned insect specimens. We also describe potential modifications, enabling it to work with other types of specimens.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The digitisation of natural science specimens is a shared ambition of many of the largest collections, but the scale of these collections, estimated at at least 1.1 billion specimens (Johnson et al. 2023), continues to challenge even the most resource-rich organisations. The Natural History Museum, London (NHM) has been pioneering work to accelerate the digitisation of its 80 million specimens. Since the inception of the NHM Digital Collection Programme in 2014, more than 5.5 million specimen records have been made digitally accessible. This has enabled the museum to deliver a tenfold increase in digitisation, compared to when rates were first measured by the NHM in 2008. Even with this investment, it will take circa 150 years to digitise its remaining collections, leading the museum to pursue technology-led solutions alongside increased funding to deliver the next increase in digitisation rate. Insects comprise approximately half of all described species and, at the NHM, represent more than one-third (c. 30 million specimens) of the NHM’s overall collection. Their most common preservation method, attached to a pin alongside a series of labels with metadata, makes insect specimens challenging to digitise. Early Artificial Intelligence (AI)-led innovations (Price et al. 2018) resulted in the development of ALICE, the museum's Angled Label Image Capture Equipment, in which a pinned specimen is placed inside a multi-camera setup, which captures a series of partial views of a specimen and its labels. Centred around the pin, these images can be digitally combined and reconstructed, using the accompanying ALICE software, to provide a clean image of each label. To do this, a Convolutional Neural Network (CNN) model is incorporated, to locate all labels within the images. This is followed by various image processing tools to transform the labels into a two-dimensional viewpoint, align the associated label images together, and merge them into one label. This allows users to manually, or computationally (e.g., using Optical Character Recognition [OCR] tools) extract label data from the processed label images (Salili-James et al. 2022). With the ALICE setup, a user might average imaging 800 digitised specimens per day, and exceptionally, up to 1,300. This compares with an average of 250 specimens or fewer daily, using more traditional methods involving separating the labels and photographing them off of the pin. Despite this, our original version of ALICE was only suited to a small subset of the collection. In situations when the specimen is very large, there are too many labels, or these labels are too close together, ALICE fails (Dupont and Price 2019). Using a combination of updated AI processing tools, we hereby present ALICE version 2. This new version of ALICE provides faster rates, improved software accuracy, and a more streamlined pipeline. It includes the following updates: Hardware: after conducting various tests, we have optimised the camera setup. Further hardware updates include a Light-Emitting Diode (LED) ring light, as well as modifications to the camera mounting. Software: our latest software incorporates machine learning and other computer vision tools to segment labels from ALICE images and stitch them together more quickly and with a higher level of accuracy, significantly reducing the image processing failure rate. These processed label images can be combined with the latest OCR tools for automatic transcription and data segmentation. Buildkit: we aim to provide a toolkit that any individual or institution can incorporate into their digitisation pipeline. This includes hardware instructions, an extensive guide detailing the pipeline, and new software code accessible via Github. Hardware: after conducting various tests, we have optimised the camera setup. Further hardware updates include a Light-Emitting Diode (LED) ring light, as well as modifications to the camera mounting. Software: our latest software incorporates machine learning and other computer vision tools to segment labels from ALICE images and stitch them together more quickly and with a higher level of accuracy, significantly reducing the image processing failure rate. These processed label images can be combined with the latest OCR tools for automatic transcription and data segmentation. Buildkit: we aim to provide a toolkit that any individual or institution can incorporate into their digitisation pipeline. This includes hardware instructions, an extensive guide detailing the pipeline, and new software code accessible via Github. We provide test data and workflows to demonstrate the potential of ALICE version 2 as an effective, accessible, and cost-saving solution to digitising pinned insect specimens. We also describe potential modifications, enabling it to work with other types of specimens.
自然科学标本的数字化是许多大型馆藏的共同目标,但这些馆藏的规模估计至少有11亿个标本(Johnson et al. 2023),即使是资源最丰富的组织也继续面临挑战。伦敦自然历史博物馆(NHM)在加速其8000万标本数字化方面一直处于领先地位。自2014年NHM数字收集计划启动以来,已有超过550万份标本记录以数字方式可供访问。这使得该博物馆的数字化程度比2008年英国国家博物馆首次测量时提高了10倍。即使有了这笔投资,将剩余藏品数字化也需要大约150年的时间,这使得博物馆在增加资金的同时,寻求以技术为主导的解决方案,以实现数字化率的下一次提高。昆虫约占所有已描述物种的一半,在国家自然博物馆,昆虫占国家自然博物馆总收藏的三分之一以上(约3000万标本)。它们最常见的保存方法是将昆虫标本与一系列带有元数据的标签一起固定在大头针上,这使得昆虫标本难以数字化。早期人工智能(AI)主导的创新(Price et al. 2018)导致了博物馆角度标签图像捕获设备ALICE的开发,其中将固定的标本放置在多摄像头设置中,该设备捕获标本及其标签的一系列局部视图。以大头针为中心,这些图像可以使用随附的ALICE软件进行数字组合和重建,以提供每个标签的清晰图像。为此,采用卷积神经网络(CNN)模型来定位图像中的所有标签。然后使用各种图像处理工具将标签转换为二维视点,将相关的标签图像对齐在一起,并将它们合并为一个标签。这允许用户手动或计算(例如,使用光学字符识别[OCR]工具)从处理过的标签图像中提取标签数据(Salili-James et al. 2022)。使用ALICE设置,用户平均每天可以成像800个数字化标本,特殊情况下可达1300个。相比之下,使用更传统的方法,包括分离标签并从大头针上拍照,每天平均只有250个或更少的标本。尽管如此,我们最初版本的ALICE只适用于集合的一小部分。在样品非常大,标签太多,或者这些标签靠得太近的情况下,ALICE会失败(Dupont and Price 2019)。结合更新的人工智能处理工具,我们在此推出ALICE版本2。这个新版本的ALICE提供了更快的速率,改进的软件准确性和更精简的管道。它包括以下更新:硬件:经过各种测试,我们优化了摄像头设置。进一步的硬件更新包括一个发光二极管(LED)环灯,以及修改摄像头安装。软件:我们最新的软件结合了机器学习和其他计算机视觉工具,从ALICE图像中分割标签,并以更高的精度更快地将它们拼接在一起,大大降低了图像处理的故障率。这些经过处理的标签图像可以与最新的OCR工具相结合,用于自动转录和数据分割。Buildkit:我们的目标是提供一个工具包,任何个人或机构都可以将其纳入其数字化管道。这包括硬件指令,详细介绍管道的广泛指南,以及通过Github访问的新软件代码。硬件:经过各种测试,我们优化了摄像头设置。进一步的硬件更新包括一个发光二极管(LED)环灯,以及修改摄像头安装。软件:我们最新的软件结合了机器学习和其他计算机视觉工具,从ALICE图像中分割标签,并以更高的精度更快地将它们拼接在一起,大大降低了图像处理的故障率。这些经过处理的标签图像可以与最新的OCR工具相结合,用于自动转录和数据分割。Buildkit:我们的目标是提供一个工具包,任何个人或机构都可以将其纳入其数字化管道。这包括硬件指令,详细介绍管道的广泛指南,以及通过Github访问的新软件代码。我们提供了测试数据和工作流程,以证明ALICE版本2作为数字化固定昆虫标本的有效,可访问和节省成本的解决方案的潜力。我们还描述了潜在的修改,使其能够与其他类型的标本工作。