Contextualised Modelling for Effective Citation Function Classification

Xiaorui Jiang, Chaoxiang Cai, Wenwen Fan, Tong Liu, Jingqiang Chen
{"title":"Contextualised Modelling for Effective Citation Function Classification","authors":"Xiaorui Jiang, Chaoxiang Cai, Wenwen Fan, Tong Liu, Jingqiang Chen","doi":"10.1145/3582768.3582769","DOIUrl":null,"url":null,"abstract":"Citation function classification is an important task in scientific text mining. The past two decades have witnessed many computerised algorithms working on various citation function datasets tailored to various annotation schemes. Recently, deep learning has pushed the state of the art by a large margin. Several pitfalls exist. Due to annotation difficulty, data sizes, especially the minority classes, are often not big enough for training effective deep learning models. Being less discussed, most state-of-the-art deep learning solutions in fact generate a feature representation for the citation sentence or context, instead of modelling individual in-text citations. This is conceptually flawed as it is common to see multiple in-text citations with different functions in the same citation sentence. In addition, existing deep learning studies have only explored a rather limited design space of encoding citation and its surrounding context. This paper explored a wide range of modelling options based on SciBERT, the popular cross-disciplinary pre-trained scientific language model, and their performances on citation function classification, for the purpose of determining the most effective way of modeling citation and its context. To deal with the data size issue, we created a large-scale citation function dataset by mapping, merging and re-annotating six publicly available datasets from the computational linguistics domain by adapting Teufel et al.’s 12-class scheme. The best F1 scores we achieved were around 66.16%, 71.39% and 73.56% on a 11-class annotation scheme slightly adapted from Teufel et al.’s 12-class scheme, a reduced 7-class scheme by merging comparison functions, and Jurgens et al.’s 6-class scheme respectively. A useful observation is that there is no single best model that is superior for all functions, therefore the trained model variants allow for applications which emphasise on a specific type of or a specific group of citation functions.","PeriodicalId":315721,"journal":{"name":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582768.3582769","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Citation function classification is an important task in scientific text mining. The past two decades have witnessed many computerised algorithms working on various citation function datasets tailored to various annotation schemes. Recently, deep learning has pushed the state of the art by a large margin. Several pitfalls exist. Due to annotation difficulty, data sizes, especially the minority classes, are often not big enough for training effective deep learning models. Being less discussed, most state-of-the-art deep learning solutions in fact generate a feature representation for the citation sentence or context, instead of modelling individual in-text citations. This is conceptually flawed as it is common to see multiple in-text citations with different functions in the same citation sentence. In addition, existing deep learning studies have only explored a rather limited design space of encoding citation and its surrounding context. This paper explored a wide range of modelling options based on SciBERT, the popular cross-disciplinary pre-trained scientific language model, and their performances on citation function classification, for the purpose of determining the most effective way of modeling citation and its context. To deal with the data size issue, we created a large-scale citation function dataset by mapping, merging and re-annotating six publicly available datasets from the computational linguistics domain by adapting Teufel et al.’s 12-class scheme. The best F1 scores we achieved were around 66.16%, 71.39% and 73.56% on a 11-class annotation scheme slightly adapted from Teufel et al.’s 12-class scheme, a reduced 7-class scheme by merging comparison functions, and Jurgens et al.’s 6-class scheme respectively. A useful observation is that there is no single best model that is superior for all functions, therefore the trained model variants allow for applications which emphasise on a specific type of or a specific group of citation functions.
有效引文功能分类的语境化建模
引文函数分类是科学文本挖掘中的一项重要任务。在过去的二十年里,我们见证了许多计算机化的算法在针对各种标注方案量身定制的各种引用函数数据集上工作。最近,深度学习在很大程度上推动了技术的发展。存在几个陷阱。由于标注困难,数据规模,特别是少数类,往往不足以训练有效的深度学习模型。讨论较少的是,大多数最先进的深度学习解决方案实际上是为引文句子或上下文生成特征表示,而不是对单个文本引用进行建模。这在概念上是有缺陷的,因为在同一个引用句子中经常看到多个具有不同功能的文本引用。此外,现有的深度学习研究只探索了一个相当有限的编码引用及其周围环境的设计空间。本文基于热门的跨学科预训练科学语言模型SciBERT,探讨了多种建模选项及其在引文功能分类上的表现,以确定最有效的引文及其上下文建模方式。为了处理数据大小问题,我们采用Teufel等人的12类方案,通过映射、合并和重新注释来自计算语言学领域的6个公开可用数据集,创建了一个大规模的引文函数数据集。我们在稍改自Teufel等人的12类标注方案、合并比较函数的简化7类标注方案和Jurgens等人的6类标注方案的11类标注方案上获得的F1得分最高,分别为66.16%、71.39%和73.56%。一个有用的观察是,没有一个最好的模型对所有函数都是优越的,因此训练好的模型变体允许强调特定类型或特定组引用函数的应用程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信