使用半监督学习识别精确肿瘤学的关键句子

Workshop on Biomedical Natural Language Processing Pub Date : 1900-01-01 DOI:10.18653/v1/W18-2305

J. Seva, Martin Wackerbauer, U. Leser

{"title":"使用半监督学习识别精确肿瘤学的关键句子","authors":"J. Seva, Martin Wackerbauer, U. Leser","doi":"10.18653/v1/W18-2305","DOIUrl":null,"url":null,"abstract":"We present a machine learning pipeline that identifies key sentences in abstracts of oncological articles to aid evidence-based medicine. This problem is characterized by the lack of gold standard datasets, data imbalance and thematic differences between available silver standard corpora. Additionally, available training and target data differs with regard to their domain (professional summaries vs. sentences in abstracts). This makes supervised machine learning inapplicable. We propose the use of two semi-supervised machine learning approaches: To mitigate difficulties arising from heterogeneous data sources, overcome data imbalance and create reliable training data we propose using transductive learning from positive and unlabelled data (PU Learning). For obtaining a realistic classification model, we propose the use of abstracts summarised in relevant sentences as unlabelled examples through Self-Training. The best model achieves 84% accuracy and 0.84 F1 score on our dataset","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Identifying Key Sentences for Precision Oncology Using Semi-Supervised Learning\",\"authors\":\"J. Seva, Martin Wackerbauer, U. Leser\",\"doi\":\"10.18653/v1/W18-2305\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a machine learning pipeline that identifies key sentences in abstracts of oncological articles to aid evidence-based medicine. This problem is characterized by the lack of gold standard datasets, data imbalance and thematic differences between available silver standard corpora. Additionally, available training and target data differs with regard to their domain (professional summaries vs. sentences in abstracts). This makes supervised machine learning inapplicable. We propose the use of two semi-supervised machine learning approaches: To mitigate difficulties arising from heterogeneous data sources, overcome data imbalance and create reliable training data we propose using transductive learning from positive and unlabelled data (PU Learning). For obtaining a realistic classification model, we propose the use of abstracts summarised in relevant sentences as unlabelled examples through Self-Training. The best model achieves 84% accuracy and 0.84 F1 score on our dataset\",\"PeriodicalId\":200974,\"journal\":{\"name\":\"Workshop on Biomedical Natural Language Processing\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Biomedical Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W18-2305\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Biomedical Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W18-2305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

我们提出了一个机器学习管道，可以识别肿瘤学文章摘要中的关键句子，以帮助循证医学。这一问题的特点是缺乏黄金标准数据集、数据不平衡以及可用的白银标准语料库之间的主题差异。此外，可用的训练数据和目标数据在其领域(专业摘要与摘要中的句子)方面有所不同。这使得监督式机器学习不适用。我们建议使用两种半监督机器学习方法:为了减轻异构数据源带来的困难，克服数据不平衡并创建可靠的训练数据，我们建议使用来自正数据和未标记数据的转导学习(PU学习)。为了获得一个真实的分类模型，我们建议使用相关句子中总结的摘要作为通过自我训练的无标记示例。在我们的数据集上，最好的模型达到了84%的准确率和0.84的F1分数

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identifying Key Sentences for Precision Oncology Using Semi-Supervised Learning

We present a machine learning pipeline that identifies key sentences in abstracts of oncological articles to aid evidence-based medicine. This problem is characterized by the lack of gold standard datasets, data imbalance and thematic differences between available silver standard corpora. Additionally, available training and target data differs with regard to their domain (professional summaries vs. sentences in abstracts). This makes supervised machine learning inapplicable. We propose the use of two semi-supervised machine learning approaches: To mitigate difficulties arising from heterogeneous data sources, overcome data imbalance and create reliable training data we propose using transductive learning from positive and unlabelled data (PU Learning). For obtaining a realistic classification model, we propose the use of abstracts summarised in relevant sentences as unlabelled examples through Self-Training. The best model achieves 84% accuracy and 0.84 F1 score on our dataset

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Biomedical Natural Language Processing

自引率

0.00%

发文量