Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-09-01 DOI:10.1109/JCDL52503.2021.00045

Ming Jiang, Yuerong Hu, Glen Worthey, Ryan Dubnicek, T. Underwood, J. S. Downie

{"title":"Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections","authors":"Ming Jiang, Yuerong Hu, Glen Worthey, Ryan Dubnicek, T. Underwood, J. S. Downie","doi":"10.1109/JCDL52503.2021.00045","DOIUrl":null,"url":null,"abstract":"The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this issue, this study evaluates the impacts of OCR noise on BERT models for encoding the intrinsic semantic features of OCR'd texts. Specifically, we encoded chapterwise paired OCR'd texts and their cleaned counterparts extracted from books in six domains using BERT pre-trained and fine-tune models respectively. Given the encoded text features, we further calculated the cosine similarity between any two chapters and used normalized discounted cumulative gain (NDCG) [1] to measure BERT variants' capabilities to preserve narrative coherence and semantic relevance among texts. Our empirical results show that (1) BERT embeddings can encode and preserve texts' intrinsic semantic features (i.e., relevance and coherence); and (2) such capabilities are comparatively robust against OCR noise. This should help alleviate some DL users' concerns regarding applying contextualized word embeddings to encode chapter-level or even document-level OCR'd text information, which benefits promoting scholarly use of DL collections. Our research also demonstrates how texts' intrinsic semantic features can be used for evaluating the impacts of OCR noise on advanced language models, which is an underdeveloped and promising direction for future work.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL52503.2021.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this issue, this study evaluates the impacts of OCR noise on BERT models for encoding the intrinsic semantic features of OCR'd texts. Specifically, we encoded chapterwise paired OCR'd texts and their cleaned counterparts extracted from books in six domains using BERT pre-trained and fine-tune models respectively. Given the encoded text features, we further calculated the cosine similarity between any two chapters and used normalized discounted cumulative gain (NDCG) [1] to measure BERT variants' capabilities to preserve narrative coherence and semantic relevance among texts. Our empirical results show that (1) BERT embeddings can encode and preserve texts' intrinsic semantic features (i.e., relevance and coherence); and (2) such capabilities are comparatively robust against OCR noise. This should help alleviate some DL users' concerns regarding applying contextualized word embeddings to encode chapter-level or even document-level OCR'd text information, which benefits promoting scholarly use of DL collections. Our research also demonstrates how texts' intrinsic semantic features can be used for evaluating the impacts of OCR noise on advanced language models, which is an underdeveloped and promising direction for future work.

查看原文本刊更多论文

评价BERT对OCR数字图书馆馆藏内在语义特征的编码

光学字符识别(OCR)噪声引起的不确定性一直是数字图书馆(DL)推广其用于研究目的的精选数据集的主要障碍，特别是当数据集被输入透明度较低的高级语言模型时。为了阐明这一问题，本研究评估了OCR噪声对BERT模型编码OCR文本内在语义特征的影响。具体来说，我们分别使用BERT预训练模型和微调模型对从六个领域中提取的书籍中按章节配对的OCR文本和清理后的文本进行编码。给定编码的文本特征，我们进一步计算任意两章之间的余弦相似度，并使用归一化贴现累积增益(NDCG)[1]来衡量BERT变体保持文本之间叙事一致性和语义相关性的能力。我们的实证结果表明:(1)BERT嵌入可以编码和保留文本的内在语义特征(即相关性和连贯性);(2)这种能力对OCR噪声具有相对的鲁棒性。这将有助于减轻一些DL用户对应用上下文化词嵌入来编码章节级甚至文档级OCR文本信息的担忧，这有利于促进DL集合的学术使用。我们的研究还展示了如何使用文本的内在语义特征来评估OCR噪声对高级语言模型的影响，这是一个不发达但有前途的未来工作方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量