{"title":"Research on text structuralization in medical field","authors":"Xiangwu Ding, Xihua Zhang","doi":"10.1109/CCIOT.2016.7868324","DOIUrl":null,"url":null,"abstract":"Transforming the non-structured medical text data into structured data is the basis of the processing and analysis of medical data. The effect of general-purpose word segmentation tools recognizing terminology is not ideal, which greatly affects the accuracy of the word segmentation, and further influences the result of text structuralization. In view of above problems, this paper puts forward a method of discovering new words based on word embedding. It uses Google open source word vector tool word2vec to train text and map the words into abstracted n-dimensional vector space. We can get the latent semantic relations between words and words in the corpus. And then combining the information entropy and word frequency, we can find new words. Finally, we design information extraction rules to get the key information according to the new words, and organize them into structured data. Experimental results on real medical data show that the accuracy is improved by 10% compared to traditional method, and the time is saved by 18% compared to traditional method.","PeriodicalId":384484,"journal":{"name":"2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 2nd International Conference on Cloud Computing and Internet of Things (CCIOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIOT.2016.7868324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Transforming the non-structured medical text data into structured data is the basis of the processing and analysis of medical data. The effect of general-purpose word segmentation tools recognizing terminology is not ideal, which greatly affects the accuracy of the word segmentation, and further influences the result of text structuralization. In view of above problems, this paper puts forward a method of discovering new words based on word embedding. It uses Google open source word vector tool word2vec to train text and map the words into abstracted n-dimensional vector space. We can get the latent semantic relations between words and words in the corpus. And then combining the information entropy and word frequency, we can find new words. Finally, we design information extraction rules to get the key information according to the new words, and organize them into structured data. Experimental results on real medical data show that the accuracy is improved by 10% compared to traditional method, and the time is saved by 18% compared to traditional method.