Research on text structuralization in medical field

Xiangwu Ding, Xihua Zhang
{"title":"Research on text structuralization in medical field","authors":"Xiangwu Ding, Xihua Zhang","doi":"10.1109/CCIOT.2016.7868324","DOIUrl":null,"url":null,"abstract":"Transforming the non-structured medical text data into structured data is the basis of the processing and analysis of medical data. The effect of general-purpose word segmentation tools recognizing terminology is not ideal, which greatly affects the accuracy of the word segmentation, and further influences the result of text structuralization. In view of above problems, this paper puts forward a method of discovering new words based on word embedding. It uses Google open source word vector tool word2vec to train text and map the words into abstracted n-dimensional vector space. We can get the latent semantic relations between words and words in the corpus. And then combining the information entropy and word frequency, we can find new words. Finally, we design information extraction rules to get the key information according to the new words, and organize them into structured data. Experimental results on real medical data show that the accuracy is improved by 10% compared to traditional method, and the time is saved by 18% compared to traditional method.","PeriodicalId":384484,"journal":{"name":"2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIOT.2016.7868324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Transforming the non-structured medical text data into structured data is the basis of the processing and analysis of medical data. The effect of general-purpose word segmentation tools recognizing terminology is not ideal, which greatly affects the accuracy of the word segmentation, and further influences the result of text structuralization. In view of above problems, this paper puts forward a method of discovering new words based on word embedding. It uses Google open source word vector tool word2vec to train text and map the words into abstracted n-dimensional vector space. We can get the latent semantic relations between words and words in the corpus. And then combining the information entropy and word frequency, we can find new words. Finally, we design information extraction rules to get the key information according to the new words, and organize them into structured data. Experimental results on real medical data show that the accuracy is improved by 10% compared to traditional method, and the time is saved by 18% compared to traditional method.
医学领域文本结构化研究
将非结构化医学文本数据转化为结构化数据是医学数据处理和分析的基础。通用分词工具识别术语的效果并不理想,这极大地影响了分词的准确性,进而影响了文本结构化的结果。针对上述问题,本文提出了一种基于词嵌入的新词发现方法。它使用谷歌开源词向量工具word2vec来训练文本,并将单词映射到抽象的n维向量空间。我们可以得到语料库中词与词之间潜在的语义关系。然后结合信息熵和词频,我们可以找到新的单词。最后,设计信息提取规则,根据生词提取关键信息,并将其组织成结构化数据。在真实医疗数据上的实验结果表明,该方法的准确率比传统方法提高了10%,时间比传统方法节省了18%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信