基于编码器-注意的自动词识别

Sampritha H. Manjunath, John P. Mccrae
{"title":"基于编码器-注意的自动词识别","authors":"Sampritha H. Manjunath, John P. Mccrae","doi":"10.4230/OASIcs.LDK.2021.23","DOIUrl":null,"url":null,"abstract":"Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA – annotated biological terms and Krapivin – scientific papers from the computer science domain. 2012 ACM Subject Classification Information systems → Top-k retrieval in databases; Computing methodologies → Information extraction; Computing methodologies → Neural networks","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Encoder-Attention-Based Automatic Term Recognition (EA-ATR)\",\"authors\":\"Sampritha H. Manjunath, John P. Mccrae\",\"doi\":\"10.4230/OASIcs.LDK.2021.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA – annotated biological terms and Krapivin – scientific papers from the computer science domain. 2012 ACM Subject Classification Information systems → Top-k retrieval in databases; Computing methodologies → Information extraction; Computing methodologies → Neural networks\",\"PeriodicalId\":377119,\"journal\":{\"name\":\"International Conference on Language, Data, and Knowledge\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Language, Data, and Knowledge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.LDK.2021.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2021.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

自动术语识别(ATR)是从原始文本中查找术语的任务。它涉及设计和开发从文本中挖掘可能的术语的技术,并根据使用出现频率等评分方法计算的分数过滤这些已识别的术语,然后对术语进行排名。目前的方法通常依赖于统计数据和词性标记上的正则表达式来识别术语,但这很容易出错。我们提出了一种深度学习技术来改进识别可能的术语序列的过程。我们通过使用基于变形器的双向编码器表示(BERT)的嵌入来识别哪个单词序列是一个术语,从而改进了术语识别。这个模型是根据维基百科的标题进行训练的。我们假设所有维基百科标题都是正集,而从原始文本生成的随机n-grams是弱负集。正集和负集将使用嵌入、编码、参与和预测(EEAP)公式进行训练,并使用BERT作为嵌入。然后,该模型将针对不同的领域特定语料库进行评估,比如GENIA——带注释的生物学术语和Krapivin——来自计算机科学领域的科学论文。2012 ACM主题分类信息系统→数据库Top-k检索;计算方法→信息提取;计算方法→神经网络
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Encoder-Attention-Based Automatic Term Recognition (EA-ATR)
Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA – annotated biological terms and Krapivin – scientific papers from the computer science domain. 2012 ACM Subject Classification Information systems → Top-k retrieval in databases; Computing methodologies → Information extraction; Computing methodologies → Neural networks
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信