{"title":"LSD600:首个注释了生活方式与疾病关系的生物医学摘要语料库","authors":"Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen","doi":"10.1101/2024.08.30.24312862","DOIUrl":null,"url":null,"abstract":"Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware transformer-based models are required to extract and classify these relations into specific relation types. No comprehensive LSF–disease RE system existed, primarily due to the lack of a suitable corpus for developing it. We present LSD600, the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5,027 diseases and 6,930 LSF entities. We evaluated LSD600’s quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multi-label RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations\",\"authors\":\"Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen\",\"doi\":\"10.1101/2024.08.30.24312862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware transformer-based models are required to extract and classify these relations into specific relation types. No comprehensive LSF–disease RE system existed, primarily due to the lack of a suitable corpus for developing it. We present LSD600, the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5,027 diseases and 6,930 LSF entities. We evaluated LSD600’s quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multi-label RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications.\",\"PeriodicalId\":501454,\"journal\":{\"name\":\"medRxiv - Health Informatics\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.30.24312862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.30.24312862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
人们日益认识到,生活方式因素(LSFs)在疾病的发生和控制中起着重要作用。尽管生活方式因素非常重要,但目前还缺乏从文献中提取生活方式因素与疾病之间关系的方法,而这是将现有知识整合成结构化形式的必要步骤。由于简单的基于共现的关系提取(RE)方法无法区分 LSF-疾病关系的不同类型,因此需要基于上下文感知转换器的模型来提取这些关系并将其分类为特定的关系类型。目前还没有全面的 LSF-疾病 RE 系统,主要原因是缺乏合适的语料库来开发该系统。我们提出了 LSD600,这是第一个专门为 LSF-疾病 RE 设计的语料库,由 600 个摘要组成,包含 5,027 种疾病和 6,930 个 LSF 实体之间八种不同类型的 1900 种关系。我们在该语料库上训练了一个 RoBERTa 模型,对 LSD600 的质量进行了评估,在测试集上的多标签 RE 任务中取得了 68.5% 的 F-score。我们还在营养疾病和食品疾病两个数据集上使用训练好的模型进一步验证了 LSD600,其 F 分数分别达到了 70.7% 和 80.7%。在这些性能结果的基础上,LSD600 及其训练的 RE 系统可以成为填补该领域现有空白的宝贵资源,并为下游应用铺平道路。
LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations
Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware transformer-based models are required to extract and classify these relations into specific relation types. No comprehensive LSF–disease RE system existed, primarily due to the lack of a suitable corpus for developing it. We present LSD600, the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5,027 diseases and 6,930 LSF entities. We evaluated LSD600’s quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multi-label RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications.