提供引文以支持事实核查:小型维基百科上需要引用的句子的上下文检测

Aida Halitaj, Arkaitz Zubiaga
{"title":"提供引文以支持事实核查:小型维基百科上需要引用的句子的上下文检测","authors":"Aida Halitaj,&nbsp;Arkaitz Zubiaga","doi":"10.1016/j.nlp.2024.100093","DOIUrl":null,"url":null,"abstract":"<div><p>Authoritative citations are critical to ensure information integrity, especially in encyclopedias like Wikipedia. To date, research on automating citation worthiness detection has largely focused on the most resourceful language, English Wikipedia, neglecting the applicability to smaller Wikipedias. In addition, previous research proposed models that analyze the content inherent to a sentence to determine its citation worthiness, overlooking the potential of additional context to improve the prediction. Addressing these gaps, our study proposes a transformer-based contextualized approach for smaller Wikipedias, presenting a novel method to compile high-quality datasets for the Albanian, Basque, and Catalan editions. We develop the <strong>C</strong>ontextualized <strong>C</strong>itation <strong>W</strong>orthiness (CCW) model, employing sentence representations enriched with adjacent sentences and topic categories for enhanced contextual insight. Empirical experiments on three newly created datasets demonstrate significant performance improvements of our contextualized CCW model, with 6%, 3% and 6% absolute improvements over the baseline for Albanian, Basque and Catalan datasets, respectively. We conduct an in-depth analysis to understand the influence and extent to which preceding and succeeding context as well as topic categories contribute to the accuracy of citation-worthiness predictions. Our findings suggest that incorporating such contextual information aids in the automatic identification of sentences in need of citations, not least when both the preceding and succeeding context are incorporated. This has implications for supporting Wikipedia projects across low-resource languages, promoting better article validation and fact-checking.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100093"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000414/pdfft?md5=5d5c2344f9651734d9e20fc37a799aae&pid=1-s2.0-S2949719124000414-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small Wikipedias\",\"authors\":\"Aida Halitaj,&nbsp;Arkaitz Zubiaga\",\"doi\":\"10.1016/j.nlp.2024.100093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Authoritative citations are critical to ensure information integrity, especially in encyclopedias like Wikipedia. To date, research on automating citation worthiness detection has largely focused on the most resourceful language, English Wikipedia, neglecting the applicability to smaller Wikipedias. In addition, previous research proposed models that analyze the content inherent to a sentence to determine its citation worthiness, overlooking the potential of additional context to improve the prediction. Addressing these gaps, our study proposes a transformer-based contextualized approach for smaller Wikipedias, presenting a novel method to compile high-quality datasets for the Albanian, Basque, and Catalan editions. We develop the <strong>C</strong>ontextualized <strong>C</strong>itation <strong>W</strong>orthiness (CCW) model, employing sentence representations enriched with adjacent sentences and topic categories for enhanced contextual insight. Empirical experiments on three newly created datasets demonstrate significant performance improvements of our contextualized CCW model, with 6%, 3% and 6% absolute improvements over the baseline for Albanian, Basque and Catalan datasets, respectively. We conduct an in-depth analysis to understand the influence and extent to which preceding and succeeding context as well as topic categories contribute to the accuracy of citation-worthiness predictions. Our findings suggest that incorporating such contextual information aids in the automatic identification of sentences in need of citations, not least when both the preceding and succeeding context are incorporated. This has implications for supporting Wikipedia projects across low-resource languages, promoting better article validation and fact-checking.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"8 \",\"pages\":\"Article 100093\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000414/pdfft?md5=5d5c2344f9651734d9e20fc37a799aae&pid=1-s2.0-S2949719124000414-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000414\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000414","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

权威引文对于确保信息的完整性至关重要,尤其是在维基百科这样的百科全书中。迄今为止,有关引用价值自动检测的研究主要集中在资源最丰富的语言--英语维基百科上,忽略了对小型维基百科的适用性。此外,以前的研究提出的模型都是通过分析句子的固有内容来确定其是否值得引用,忽略了附加上下文改善预测的潜力。为了弥补这些不足,我们的研究针对较小的维基百科提出了一种基于转换器的语境化方法,并提出了一种为阿尔巴尼亚语、巴斯克语和加泰罗尼亚语版本编制高质量数据集的新方法。我们开发了上下文关联引文价值(CCW)模型,利用句子表示法丰富了相邻句子和主题类别,从而增强了上下文洞察力。在三个新创建的数据集上进行的实证实验表明,我们的语境化 CCW 模型的性能有了显著提高,与基线相比,阿尔巴尼亚语、巴斯克语和加泰罗尼亚语数据集的绝对性能分别提高了 6%、3% 和 6%。我们进行了深入分析,以了解前后语境和主题类别对引文价值预测准确性的影响和贡献程度。我们的研究结果表明,纳入此类上下文信息有助于自动识别需要引用的句子,尤其是在同时纳入前文和后文的情况下。这对支持低资源语言的维基百科项目、促进更好的文章验证和事实检查具有重要意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small Wikipedias

Authoritative citations are critical to ensure information integrity, especially in encyclopedias like Wikipedia. To date, research on automating citation worthiness detection has largely focused on the most resourceful language, English Wikipedia, neglecting the applicability to smaller Wikipedias. In addition, previous research proposed models that analyze the content inherent to a sentence to determine its citation worthiness, overlooking the potential of additional context to improve the prediction. Addressing these gaps, our study proposes a transformer-based contextualized approach for smaller Wikipedias, presenting a novel method to compile high-quality datasets for the Albanian, Basque, and Catalan editions. We develop the Contextualized Citation Worthiness (CCW) model, employing sentence representations enriched with adjacent sentences and topic categories for enhanced contextual insight. Empirical experiments on three newly created datasets demonstrate significant performance improvements of our contextualized CCW model, with 6%, 3% and 6% absolute improvements over the baseline for Albanian, Basque and Catalan datasets, respectively. We conduct an in-depth analysis to understand the influence and extent to which preceding and succeeding context as well as topic categories contribute to the accuracy of citation-worthiness predictions. Our findings suggest that incorporating such contextual information aids in the automatic identification of sentences in need of citations, not least when both the preceding and succeeding context are incorporated. This has implications for supporting Wikipedia projects across low-resource languages, promoting better article validation and fact-checking.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信