Research on text structuralization in medical field

2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT) Pub Date : 2016-10-01 DOI:10.1109/CCIOT.2016.7868324

Xiangwu Ding, Xihua Zhang

引用次数: 0

Abstract

Transforming the non-structured medical text data into structured data is the basis of the processing and analysis of medical data. The effect of general-purpose word segmentation tools recognizing terminology is not ideal, which greatly affects the accuracy of the word segmentation, and further influences the result of text structuralization. In view of above problems, this paper puts forward a method of discovering new words based on word embedding. It uses Google open source word vector tool word2vec to train text and map the words into abstracted n-dimensional vector space. We can get the latent semantic relations between words and words in the corpus. And then combining the information entropy and word frequency, we can find new words. Finally, we design information extraction rules to get the key information according to the new words, and organize them into structured data. Experimental results on real medical data show that the accuracy is improved by 10% compared to traditional method, and the time is saved by 18% compared to traditional method.

查看原文本刊更多论文

医学领域文本结构化研究

将非结构化医学文本数据转化为结构化数据是医学数据处理和分析的基础。通用分词工具识别术语的效果并不理想，这极大地影响了分词的准确性，进而影响了文本结构化的结果。针对上述问题，本文提出了一种基于词嵌入的新词发现方法。它使用谷歌开源词向量工具word2vec来训练文本，并将单词映射到抽象的n维向量空间。我们可以得到语料库中词与词之间潜在的语义关系。然后结合信息熵和词频，我们可以找到新的单词。最后，设计信息提取规则，根据生词提取关键信息，并将其组织成结构化数据。在真实医疗数据上的实验结果表明，该方法的准确率比传统方法提高了10%，时间比传统方法节省了18%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT)

自引率

0.00%

发文量