使用经验构建的词汇资源进行命名实体识别。

Biomedical informatics insights Pub Date : 2013-06-24 Print Date: 2013-01-01 DOI:10.4137/BII.S11664
Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez
{"title":"使用经验构建的词汇资源进行命名实体识别。","authors":"Siddhartha Jonnalagadda,&nbsp;Trevor Cohen,&nbsp;Stephen Wu,&nbsp;Hongfang Liu,&nbsp;Graciela Gonzalez","doi":"10.4137/BII.S11664","DOIUrl":null,"url":null,"abstract":"<p><p>Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"17-27"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11664","citationCount":"16","resultStr":"{\"title\":\"Using empirically constructed lexical resources for named entity recognition.\",\"authors\":\"Siddhartha Jonnalagadda,&nbsp;Trevor Cohen,&nbsp;Stephen Wu,&nbsp;Hongfang Liu,&nbsp;Graciela Gonzalez\",\"doi\":\"10.4137/BII.S11664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes. </p>\",\"PeriodicalId\":88397,\"journal\":{\"name\":\"Biomedical informatics insights\",\"volume\":\"6 Suppl 1\",\"pages\":\"17-27\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.4137/BII.S11664\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomedical informatics insights\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4137/BII.S11664\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2013/1/1 0:00:00\",\"PubModel\":\"Print\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical informatics insights","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4137/BII.S11664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2013/1/1 0:00:00","PubModel":"Print","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

摘要

由于隐私问题和创建带注释的语料库所涉及的费用,现有的小注释语料库可能没有足够的示例来学习以统计方式精确地提取所有命名实体。在这项工作中,我们评估了在使用机器学习命名实体识别(NER)时基于分布式语义自动生成的特征的价值。我们生成和实验的特征包括n个最近的词、支持向量机(SVM)区域和术语聚类,所有这些都被认为是分布式语义特征。与在基线系统中使用手动构建的词典相比,添加n个最接近的单词特征导致f分数的提高更大。虽然不排除需要相对较小的带注释的语料库进行再训练,但经验地从未注释的文本中获得的词汇不仅可以补充人工创建的词汇,还可以替代它们。这种现象在从生物医学文献和临床记录中提取概念时都可以观察到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Using empirically constructed lexical resources for named entity recognition.

Using empirically constructed lexical resources for named entity recognition.

Using empirically constructed lexical resources for named entity recognition.

Using empirically constructed lexical resources for named entity recognition.

Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信