From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

medRxiv - Rheumatology Pub Date : 2024-04-27 DOI:10.1101/2024.04.26.24306269

Alfredo Madrid-García, Beatriz Merino-Barbancho, Dalifer Freites-Núñez, Luis Rodríguez-Rodríguez, Ernestina Menasalvas-Ruíz, Alejandro Rodríguez-González, Anselmo Peñas

{"title":"From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology","authors":"Alfredo Madrid-García, Beatriz Merino-Barbancho, Dalifer Freites-Núñez, Luis Rodríguez-Rodríguez, Ernestina Menasalvas-Ruíz, Alejandro Rodríguez-González, Anselmo Peñas","doi":"10.1101/2024.04.26.24306269","DOIUrl":null,"url":null,"abstract":"This study introduces RheumaLinguisticpack (RheumaLpack), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise, RheumaL-pack systematically captures and curates data across a spectrum of web sources including clinical trials registers (i.e., ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. EMA), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Hardvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases’ effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development of RheumaLpack involved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023. RheumaLpack represents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus’s potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to build RheumaL(inguistic)pack are also provided to facilitate the dissemination of such resource.","PeriodicalId":501212,"journal":{"name":"medRxiv - Rheumatology","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Rheumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.04.26.24306269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study introduces RheumaLinguisticpack (RheumaLpack), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise, RheumaL-pack systematically captures and curates data across a spectrum of web sources including clinical trials registers (i.e., ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. EMA), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Hardvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases’ effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development of RheumaLpack involved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023. RheumaLpack represents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus’s potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to build RheumaL(inguistic)pack are also provided to facilitate the dissemination of such resource.

查看原文本刊更多论文

从网络到 RheumaLpack：创建用于风湿病学开发和知识发现的语言语料库

本研究介绍的 RheumaLinguisticpack（RheumaLpack）是首个专为肌肉骨骼疾病领域设计的专业语言网络语料库。通过结合网络挖掘（即网络刮削）和自然语言处理（NLP）技术以及临床专业知识，RheumaL-pack 系统地捕获和整理了各种网络来源的数据，包括临床试验登记（即 ClinicalTrials.gov、ClinicalTrials.gov）、书目数据库（如 PubMed）、医疗机构（如 EMA）、社交媒体（如 Reddit）和认证健康网站（如 MedlinePlus、Hardvard Health Publishing 和 Cleveland Clinic）。鉴于风湿病和肌肉骨骼疾病（RMDs）的复杂性及其对生活质量的重大影响，该资源可作为一种有用的工具，用于训练可减轻疾病影响的算法。因此，该语料库旨在改进人工智能（AI）算法的训练，促进 RMDs 方面的知识发现。RheumaLpack 的开发涉及系统的六步方法，包括数据识别、特征描述、选择、收集、处理和语料库描述。最终形成了一个非注释、单语和动态的语料库，包含从 2000 年到 2023 年的近 300 万条记录。RheumaLpack 是对风湿病学研究的开创性贡献，为开发高级人工智能和 NLP 应用程序提供了有用的资源。该语料库凸显了网络数据在应对肌肉骨骼疾病挑战方面的价值，说明了该语料库在改进风湿病学研究和治疗模式方面的潜力。最后，所展示的方法可以复制，以获取其他医学专业的数据。此外，还提供了如何构建 RheumaL（inguistic）pack 的代码和详细信息，以促进此类资源的传播。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Rheumatology

自引率

0.00%

发文量