Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.

IF 1.3 4区 医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Methods of Information in Medicine Pub Date : 2021-06-01 Epub Date: 2021-07-08 DOI:10.1055/s-0041-1731390
Faith Wavinya Mutinda, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki
{"title":"Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.","authors":"Faith Wavinya Mutinda, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki","doi":"10.1055/s-0041-1731390","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.</p><p><strong>Objective: </strong>The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.</p><p><strong>Materials: </strong>We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.</p><p><strong>Methods: </strong>We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.</p><p><strong>Results: </strong>The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"60 S 01","pages":"e56-e64"},"PeriodicalIF":1.3000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/79/46/10-1055-s-0041-1731390.PMC8294940.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0041-1731390","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/7/8 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.

Objective: The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.

Materials: We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.

Methods: We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.

Results: The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.

Abstract Image

Abstract Image

Abstract Image

使用 BERT 实现日本临床领域文本的语义文本相似性。
背景语义文本相似性(STS)捕捉文本之间的语义相似程度。它在文本摘要、问题解答、机器翻译、信息检索、对话系统、剽窃检测和查询排序等许多自然语言处理应用中发挥着重要作用。STS 已在普通英语领域得到广泛研究。然而,临床领域和英语以外语言(如日语)的 STS 任务资源却很少:本研究的目的是通过创建一个公开可用的日语数据集来捕捉日语临床文本(日语临床 STS)之间的语义相似性:我们创建了两个日语临床 STS 数据集:(1) 日语病例报告(CR 数据集)和 (2) 日语电子病历(EMR 数据集)。CR数据集是从CiNii数据库中提取的公开病例报告中创建的。EMR 数据集来自日本电子病历:我们使用一种基于转换器双向编码器表示(BERT)的方法来捕捉临床领域文本之间的语义相似性。BERT 是迁移学习的一种常用方法,已被证明能有效地在小型数据集上实现高准确率。我们采用了两种日语预训练 BERT 模型:普通日语 BERT 和临床日语 BERT。一般日语 BERT 以日语维基百科文本为基础进行预训练,而临床日语 BERT 则以日语临床文本为基础进行预训练:在我们的数据集中,BERT 模型在捕捉语义相似性方面表现良好。一般日语 BERT 的表现优于临床日语 BERT,并且与人类评分达到了很高的相关性(在 CR 数据集中为 0.904,在 EMR 数据集中为 0.875)。令人意想不到的是,在临床领域数据集上,普通日语 BERT 的表现优于临床日语 BERT。这可能是由于与临床日语 BERT 相比,普通日语 BERT 在广泛的文本上进行了预训练。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Methods of Information in Medicine
Methods of Information in Medicine 医学-计算机:信息系统
CiteScore
3.70
自引率
11.80%
发文量
33
审稿时长
6-12 weeks
期刊介绍: Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信