Biodiversity Information Science and Standards最新文献

筛选
英文 中文
Meeting Report for the Phenoscape TraitFest 2023 with Comments on Organising Interdisciplinary Meetings 2023 年 Phenoscape 特质节会议报告及对组织跨学科会议的评论
Biodiversity Information Science and Standards Pub Date : 2024-03-06 DOI: 10.3897/biss.8.115232
Jennifer C. Girón Duque, Meghan Balk, W. Dahdul, H. Lapp, István Mikó, Elie Alhajjar, Brenen M. Wynd, Sergei Tarasov, Christopher Lawrence, Basanta Khakurel, Arthur Porto, Lin Yan, Isadora E Fluck, D. Porto, Joseph Keating, I. Borokini, Katja Seltmann, G. Montanaro, Paula M. Mabee
{"title":"Meeting Report for the Phenoscape TraitFest 2023 with Comments on Organising Interdisciplinary Meetings","authors":"Jennifer C. Girón Duque, Meghan Balk, W. Dahdul, H. Lapp, István Mikó, Elie Alhajjar, Brenen M. Wynd, Sergei Tarasov, Christopher Lawrence, Basanta Khakurel, Arthur Porto, Lin Yan, Isadora E Fluck, D. Porto, Joseph Keating, I. Borokini, Katja Seltmann, G. Montanaro, Paula M. Mabee","doi":"10.3897/biss.8.115232","DOIUrl":"https://doi.org/10.3897/biss.8.115232","url":null,"abstract":"The Phenoscape project has developed ontology-based tools and a knowledge base that enables the integration and discovery of phenotypes across species from the scientific literature. The Phenoscape TraitFest 2023 event aimed to promote innovative applications that adopt the capabilities supported by the data in the Phenoscape Knowledgebase and its corresponding semantics-enabled tools, algorithms and infrastructure. The event brought together 26 participants, including domain experts in biodiversity informatics, taxonomy and phylogenetics and software developers from various life-sciences programming toolkits and phylogenetic software projects, for an intense four-day collaborative software coding event. The event was designed as a hands-on workshop, based on the Open Space Technology methodology, in which participants self-organise into subgroups to collaboratively plan and work on their shared research interests. We describe how the workshop was organised, the projects developed and outcomes resulting from the workshop, as well as the challenges in bringing together a diverse group of participants to engage productively in a collaborative environment.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"15 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140263231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementation Experience Report for the Developing Latimer Core Standard: The DiSSCo Flanders use-case 制定拉蒂默核心标准的实施经验报告:法兰德斯 DiSSCo 案例
Biodiversity Information Science and Standards Pub Date : 2023-11-29 DOI: 10.3897/biss.7.113766
Lissa Breugelmans, Maarten Trekels
{"title":"Implementation Experience Report for the Developing Latimer Core Standard: The DiSSCo Flanders use-case","authors":"Lissa Breugelmans, Maarten Trekels","doi":"10.3897/biss.7.113766","DOIUrl":"https://doi.org/10.3897/biss.7.113766","url":null,"abstract":"<jats:p />","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"605 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139213641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Future of Natural History Transcription: Navigating AI advancements with VoucherVision and the Specimen Label Transcription Project (SLTP) 自然历史转录的未来:利用VoucherVision和标本标签转录项目(SLTP)导航人工智能的进步
Biodiversity Information Science and Standards Pub Date : 2023-09-21 DOI: 10.3897/biss.7.113067
William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel
{"title":"The Future of Natural History Transcription: Navigating AI advancements with VoucherVision and the Specimen Label Transcription Project (SLTP)","authors":"William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel","doi":"10.3897/biss.7.113067","DOIUrl":"https://doi.org/10.3897/biss.7.113067","url":null,"abstract":"Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors.\u0000 \u0000 Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections.\u0000 An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
No Pain No Gain: Standards mapping in Latimer Core development 没有付出就没有收获:Latimer Core开发中的标准映射
Biodiversity Information Science and Standards Pub Date : 2023-09-21 DOI: 10.3897/biss.7.113053
Matt Woodburn, Jutta Buschbom, Sharon Grant, Janeen Jones, Ben Norton, Maarten Trekels, Sarah Vincent, Kate Webbink
{"title":"No Pain No Gain: Standards mapping in Latimer Core development","authors":"Matt Woodburn, Jutta Buschbom, Sharon Grant, Janeen Jones, Ben Norton, Maarten Trekels, Sarah Vincent, Kate Webbink","doi":"10.3897/biss.7.113053","DOIUrl":"https://doi.org/10.3897/biss.7.113053","url":null,"abstract":"Latimer Core (LtC) is a new proposed Biodiversity Information Standards (TDWG) data standard that supports the representation and discovery of natural science collections by structuring data about the groups of objects that those collections and their subcomponents encompass (Woodburn et al. 2022). It is designed to be applicable to a range of use cases that include high level collection registries, rich textual narratives and semantic networks of collections, as well as more granular, quantitative breakdowns of collections to aid collection discovery and digitisation planning. As a standard that is (in this first version) focused on natural science collections, LtC has significant intersections with existing data standards and models (Fig. 1) that represent individual natural science objects and occurrences and their associated data (e.g., Darwin Core (DwC), Access to Biological Collection Data (ABCD), Conceptual Reference Model of the International Committee on Documentation (CIDOC-CRM)). LtC’s scope also overlaps with standards for more generic concepts like metadata, organisations, people and activities (i.e., Dublin Core, World Wide Web Consortium (W3C) ORG Ontology and PROV Ontology, Schema.org). LtC represents just an element of this extended network of data standards for the natural sciences and related concepts. Mapping between LtC and intersecting standards is therefore crucial for avoiding duplication of effort in the standard development process, and ensuring that data stored using the different standards are as interoperable as possible in alignment with FAIR (Findable, Accessible, Interoperable, Reusable) principles. In particular, it is vital to make robust associations between records representing groups of objects in LtC and records (where available) that represent the objects within those groups. During LtC development, efforts were made to identify and align with relevant standards and vocabularies, and adopt existing terms from them where possible. During expert review, a more structured approach was proposed and implemented using the Simple Knowledge Organization System (SKOS) mappingRelation vocabulary. This exercise helped to better describe the nature of the mappings between new LtC terms and related terms in other standards, and to validate decisions around the borrowing of existing terms for LtC. A further exercise also used elements of the Simple Standard for Sharing Ontological Mappings (SSSOM) to start to develop a more comprehensive set of metadata around these mappings. At present, these mappings (Suppl. material 1 and Suppl. material 2) are provisional and not considered to be comprehensive, but should be further refined and expanded over time. Even with the support provided by the SKOS and SSSOM standards, the LtC experience has proven the mapping process to be far from straightforward. Different standards vary in how they are structured, for example, DwC is a ‘bag of terms’, with informal classes and no structura","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136236194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structuring Information from Plant Morphological Descriptions using Open Information Extraction 利用开放信息提取技术构建植物形态描述信息
Biodiversity Information Science and Standards Pub Date : 2023-09-21 DOI: 10.3897/biss.7.113055
Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz
{"title":"Structuring Information from Plant Morphological Descriptions using Open Information Extraction","authors":"Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz","doi":"10.3897/biss.7.113055","DOIUrl":"https://doi.org/10.3897/biss.7.113055","url":null,"abstract":"Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form &amp;lt;noun phrase, relation phrase, noun phrase&amp;gt; (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodi","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136155252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative Study: Evaluating the effects of class balancing on transformer performance in the PlantNet-300k image dataset 比较研究:在PlantNet-300k图像数据集中评估类平衡对变压器性能的影响
Biodiversity Information Science and Standards Pub Date : 2023-09-21 DOI: 10.3897/biss.7.113057
José Chavarría Madriz, Maria Mora-Cross, William Ulate
{"title":"Comparative Study: Evaluating the effects of class balancing on transformer performance in the PlantNet-300k image dataset","authors":"José Chavarría Madriz, Maria Mora-Cross, William Ulate","doi":"10.3897/biss.7.113057","DOIUrl":"https://doi.org/10.3897/biss.7.113057","url":null,"abstract":"Image-based identification of plant specimens plays a crucial role in various fields such as agriculture, ecology, and biodiversity conservation. The growing interest in deep learning has led to remarkable advancements in image classification techniques, particularly with the utilization of convolutional neural networks (CNNs). Since 2015, in the context of the PlantCLEF (Conference and Labs of the Evaluation Forum) challenge (Joly et al. 2015), deep learning models, specifically CNNs, have consistently achieved the most impressive results in this field (Carranza-Rojas 2018). However, recent developments have introduced transformer-based models, such as ViT (Vision Transformer) (Dosovitskiy et al. 2020) and CvT (Convolutional vision Transformer) (Wu et al. 2021), as a promising alternative for image classification tasks. Transformers offer unique advantages such as capturing global context and handling long-range dependencies (Vaswani et al. 2017), which make them suitable for complex recognition tasks like plant identification. In this study, we focus on the image classification task using the PlantNet-300k dataset (Garcin et al. 2021a). The dataset consists of a large collection of 306,146 plant images representing 1,081 distinct species. These images were selected from the Pl@ntNet citizen observatory database. The dataset has two prominent characteristics that pose challenges for classification. First, there is a significant class imbalance, meaning that a small subset of species dominates the majority of the images. This imbalance creates bias and affects the accuracy of classification models. Second, many species exhibit visual similarities, making it tough, even for experts, to accurately identify them. These characteristics are referred to by the dataset authors as long-tailed distribution and high intrinsic ambiguity, respectively (Garcin et al. 2021b). In order to address the inherent challenges of the PlantNet-300k dataset, we employed a two-fold approach. Firstly, we leveraged transformer-based models to tackle the dataset's intrinsic ambiguity and effectively capture the complex visual patterns present in plant images. Secondly, we focused on mitigating the class imbalance issue through various data preprocessing techniques, specifically class balancing methods. By implementing these techniques, we aimed to ensure fair representation of all plant species in order to improve the overall performance of image classification models. Our objective is to assess the effects of data preprocessing techniques, specifically class balancing, on the classification performance of the PlantNet-300k dataset. By exploring different preprocessing methods, we addressed the class imbalance issue and through precise evaluation, conducted a comparison of the performance of transformer-based models with and without class balancing techniques. Through these efforts, our ultimate goal is to assert if these techniques allow us to achieve more accurate and rel","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136236192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Filling Gaps in Earthworm Digital Diversity in Northern Eurasia from Russian-language Literature 从俄语文献填补欧亚大陆北部蚯蚓数字多样性的空白
Biodiversity Information Science and Standards Pub Date : 2023-09-20 DOI: 10.3897/biss.7.112957
Maxim Shashkov, Natalya Ivanova, Sergey Ermolov
{"title":"Filling Gaps in Earthworm Digital Diversity in Northern Eurasia from Russian-language Literature","authors":"Maxim Shashkov, Natalya Ivanova, Sergey Ermolov","doi":"10.3897/biss.7.112957","DOIUrl":"https://doi.org/10.3897/biss.7.112957","url":null,"abstract":"Data availability for certain groups of organisms (ecosystem engineers, invasive or protected species, etc.) is important for monitoring and making predictions in changing environments. One of the most promising directions for research on the impact of changes is species distribution modelling. Such technologies are highly dependent on occurrence data of high quality (Van Eupen et al. 2021). Earthworms (order Crassiclitellata) are a key group of organisms (Lavelle 2014), but their distribution around the globe is underrepresented in digital resources. Dozens of earthworm species, both widespread and endemic, inhabit the territory of Northern Eurasia (Perel 1979), but extremely poor data on them is available through global biodiversity repositories (Cameron 2018). There are two main obstacles to data mobilisation. Firstly, studies of the diversity of earthworms in Northen Eurasia have a long history (since the end of the nineteenth century) and were conducted by several generations of Soviet and Russian researchers. Most of the collected data have been published in \"grey literature\", now stored only in a few libraries. Until recently, most of these remained largely undigitised, and some are probably irretrievably lost. The second problem is the difference in the taxonomic checklists used by Soviet and European researchers. Not all species and synonyms are included in the GBIF (Global Biodiversity Information Facility) Backbone Taxonomy. As a result, existing earthworm species distribution models (Phillips 2019) potentially miss a significant amount of data and may underestimate biodiversity, and predict distributions inaccurately. To fill this gap, we collected occurrence data from the Russian language literature (published by Soviet and Russian researchers) and digitised species checklists, keeping the original scientific names. To find relevant literature, we conducted a keyword search for \"earthworms\" and \"Lumbricidae\" through the Russian national scientific online library eLibrary and screened reference lists from the monographs of leading Soviet and Russian soil zoologist Tamara Perel (Vsevolodova-Perel 1997, Perel 1979). As a result, about 1,000 references were collected, of which 330 papers had titles indicating the potential to contain data on earthworm occurrences. Among these, 219 were found as PDF files or printed papers. For dataset compilation, 159 papers were used; the others had no exact location data or duplicated data contained in other papers. Most of the sources were peer-reviewed articles (Table 1). A reference list is available through Zenodo (Ivanova et al. 2023). The earliest publication we could find dates back to 1899, by Wilhelm Michaelsen. The most recent publication is 2023. About a third of the sources were written by systematists Iosif Malevich and Tamara Perel. Occurrence data were extracted and structured according to the Darwin Core standard (Wieczorek et al. 2012). During the data digitisation process, we tried to","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robot-in-the-loop: Prototyping robotic digitisation at the Natural History Museum 机器人在循环:原型机器人数字化在自然历史博物馆
Biodiversity Information Science and Standards Pub Date : 2023-09-20 DOI: 10.3897/biss.7.112947
Ben Scott, Arianna Salili-James, Vincent Smith
{"title":"Robot-in-the-loop: Prototyping robotic digitisation at the Natural History Museum","authors":"Ben Scott, Arianna Salili-James, Vincent Smith","doi":"10.3897/biss.7.112947","DOIUrl":"https://doi.org/10.3897/biss.7.112947","url":null,"abstract":"The Natural History Museum, London (NHM) is home to an impressive collection of over 80 million specimens, of which just 5.5 million have been digitised. Like all similar collections, digitisation of these specimens is very labour intensive, requiring time-consuming manual handling. Each specimen is extracted from its curatorial unit, placed for imaging, labels are manually manipulated, and then returned to storage. Thanks to the NHM’s team of digitisers, workflows are becoming more efficient as they are refined. However, many of these workflows are highly repetitive and ideally suited to automation. The museum is now exploring integrating robots into the digitisation process. The NHM has purchased a Techman TM5 900 robotic arm, equipped with integrated Artificial Intelligence (AI) software and additional features such as custom grippers and a 3D scanner. This robotic arm combines advanced imaging technologies, machine learning algorithms, and robotic manipulation capabilities to capture high-quality specimen data, making it possible to digitise vast collections efficiently (Fig. 1). We showcase the NHM's application of robotics for digitisation, outlining the use cases developed for implementation and the prototypical workflows already in place at the museum. We will explore our invasive and non-invasive digitisation experiments, the many challenges, and the initial results of our early experiments with this transformative technology.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What Can You Do With 200 Million Newspaper Articles: Exploring GLAM data in the Humanities 如何处理2亿篇报纸文章:探索人文学科的GLAM数据
Biodiversity Information Science and Standards Pub Date : 2023-09-19 DOI: 10.3897/biss.7.112935
Tim Sherratt
{"title":"What Can You Do With 200 Million Newspaper Articles: Exploring GLAM data in the Humanities","authors":"Tim Sherratt","doi":"10.3897/biss.7.112935","DOIUrl":"https://doi.org/10.3897/biss.7.112935","url":null,"abstract":"I’m a historian who works with data from the GLAM sector (galleries, libraries, archives and museums). When I talk about GLAM data, I’m usually talking about things like newspapers, government documents, photographs, letters, websites, and books. Some of it is well-described, structured, and easily accessible, and some is not. All of it offers us the chance to ask new questions of our past, to see things differently. But what tools, what examples, what documentation, and what support are needed to encourage researchers to explore these possibilities—to engage with collections as data? In this talk, I’ll be describing some of my own adventures amidst GLAM data, before focusing on questions of access, infrastructure, and skills development. In particular, I’ll be introducing the GLAM Workbench—a collection of tools, tutorials, examples, and hacks aimed at helping humanities researchers navigate the world of data. What pathways do we need, and how can we build them?","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135061374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using ChatGPT with Confidence for Biodiversity-Related Information Tasks 在生物多样性相关信息任务中自信地使用ChatGPT
Biodiversity Information Science and Standards Pub Date : 2023-09-19 DOI: 10.3897/biss.7.112926
Michael Elliott, José Fortes
{"title":"Using ChatGPT with Confidence for Biodiversity-Related Information Tasks","authors":"Michael Elliott, José Fortes","doi":"10.3897/biss.7.112926","DOIUrl":"https://doi.org/10.3897/biss.7.112926","url":null,"abstract":"Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked \"Does Acer saccharum naturally occur in Benton, Oregon?\", ChatGPT responded \"YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon\". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana , the word \"neomexicana\" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistake","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135061369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信