使用经验构建的词汇资源进行命名实体识别。

Biomedical informatics insights Pub Date : 2013-06-24 Print Date: 2013-01-01 DOI:10.4137/BII.S11664

Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez

{"title":"使用经验构建的词汇资源进行命名实体识别。","authors":"Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez","doi":"10.4137/BII.S11664","DOIUrl":null,"url":null,"abstract":"Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes. ","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"17-27"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11664","citationCount":"16","resultStr":"{\"title\":\"Using empirically constructed lexical resources for named entity recognition.\",\"authors\":\"Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez\",\"doi\":\"10.4137/BII.S11664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes. \",\"PeriodicalId\":88397,\"journal\":{\"name\":\"Biomedical informatics insights\",\"volume\":\"6 Suppl 1\",\"pages\":\"17-27\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.4137/BII.S11664\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomedical informatics insights\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4137/BII.S11664\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2013/1/1 0:00:00\",\"PubModel\":\"Print\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical informatics insights","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4137/BII.S11664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2013/1/1 0:00:00","PubModel":"Print","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

由于隐私问题和创建带注释的语料库所涉及的费用，现有的小注释语料库可能没有足够的示例来学习以统计方式精确地提取所有命名实体。在这项工作中，我们评估了在使用机器学习命名实体识别(NER)时基于分布式语义自动生成的特征的价值。我们生成和实验的特征包括n个最近的词、支持向量机(SVM)区域和术语聚类，所有这些都被认为是分布式语义特征。与在基线系统中使用手动构建的词典相比，添加n个最接近的单词特征导致f分数的提高更大。虽然不排除需要相对较小的带注释的语料库进行再训练，但经验地从未注释的文本中获得的词汇不仅可以补充人工创建的词汇，还可以替代它们。这种现象在从生物医学文献和临床记录中提取概念时都可以观察到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Using empirically constructed lexical resources for named entity recognition.

查看原文本刊更多论文

Using empirically constructed lexical resources for named entity recognition.

Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biomedical informatics insights

自引率

0.00%

发文量