arXiv - CS - Digital Libraries最新文献_第2页

Exploring the applicability of Large Language Models to citation context analysis 探索大语言模型在引文语境分析中的适用性

arXiv - CS - Digital Libraries Pub Date : 2024-09-04 DOI: arxiv-2409.02443

Kai Nishikawa, Hitoshi Koshiba

{"title":"Exploring the applicability of Large Language Models to citation context analysis","authors":"Kai Nishikawa, Hitoshi Koshiba","doi":"arxiv-2409.02443","DOIUrl":"https://doi.org/arxiv-2409.02443","url":null,"abstract":"Unlike traditional citation analysis -- which assumes that all citations in a\u0000paper are equivalent -- citation context analysis considers the contextual\u0000information of individual citations. However, citation context analysis\u0000requires creating large amounts of data through annotation, which hinders the\u0000widespread use of this methodology. This study explored the applicability of\u0000Large Language Models (LLMs) -- particularly ChatGPT -- to citation context\u0000analysis by comparing LLMs and human annotation results. The results show that\u0000the LLMs annotation is as good as or better than the human annotation in terms\u0000of consistency but poor in terms of predictive performance. Thus, having LLMs\u0000immediately replace human annotators in citation context analysis is\u0000inappropriate. However, the annotation results obtained by LLMs can be used as\u0000reference information when narrowing the annotation results obtained by\u0000multiple human annotators to one, or LLMs can be used as one of the annotators\u0000when it is difficult to prepare sufficient human annotators. This study\u0000provides basic findings important for the future development of citation\u0000context analyses.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coverage and metadata availability of African publications in OpenAlex: A comparative analysis OpenAlex 中非洲出版物的覆盖面和元数据可用性：比较分析

arXiv - CS - Digital Libraries Pub Date : 2024-09-02 DOI: arxiv-2409.01120

Patricia Alonso-Alvarez, Nees Jan van Eck

{"title":"Coverage and metadata availability of African publications in OpenAlex: A comparative analysis","authors":"Patricia Alonso-Alvarez, Nees Jan van Eck","doi":"arxiv-2409.01120","DOIUrl":"https://doi.org/arxiv-2409.01120","url":null,"abstract":"Unlike traditional proprietary data sources like Scopus and Web of Science\u0000(WoS), OpenAlex emphasizes its comprehensive coverage, particularly\u0000highlighting its inclusion of the humanities, non-English languages, and\u0000research from the Global South. Strengthening diversity and inclusivity in\u0000science is crucial for ethical and practical reasons. This paper analyses\u0000OpenAlex's coverage and metadata availability of African-based publications.\u0000For this purpose, we compare OpenAlex with Scopus, WoS, and African Journals\u0000Online (AJOL). We first compare the coverage of African research publications\u0000in OpenAlex against that of WoS, Scopus, and AJOL. We then assess and compare\u0000the available metadata for OpenAlex, Scopus, and WoS publications. Our analysis\u0000shows that OpenAlex offers the most extensive publication coverage. In terms of\u0000metadata, OpenAlex offers a high coverage of publication and author\u0000information. It performs worse regarding affiliations, references, and funder\u0000information. Importantly, our results also show that metadata availability in\u0000OpenAlex is better for publications that are also indexed in Scopus or WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simbanex: Similarity-based Exploration of IEEE VIS Publications Simbanex：基于相似性的 IEEE VIS 出版物探索

arXiv - CS - Digital Libraries Pub Date : 2024-08-31 DOI: arxiv-2409.00478

Daniel Witschard, Ilir Jusufi, Andreas Kerren

引用次数: 0

Post-OCR Text Correction for Bulgarian Historical Documents 保加利亚历史文献的后OCR 文本更正

arXiv - CS - Digital Libraries Pub Date : 2024-08-31 DOI: arxiv-2409.00527

Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

{"title":"Post-OCR Text Correction for Bulgarian Historical Documents","authors":"Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov","doi":"arxiv-2409.00527","DOIUrl":"https://doi.org/arxiv-2409.00527","url":null,"abstract":"The digitization of historical documents is crucial for preserving the\u0000cultural heritage of the society. An important step in this process is\u0000converting scanned images to text using Optical Character Recognition (OCR),\u0000which can enable further search, information extraction, etc. Unfortunately,\u0000this is a hard problem as standard OCR tools are not tailored to deal with\u0000historical orthography as well as with challenging layouts. Thus, it is\u0000standard to apply an additional text correction step on the OCR output when\u0000dealing with such documents. In this work, we focus on Bulgarian, and we create\u0000the first benchmark dataset for evaluating the OCR text correction for\u0000historical Bulgarian documents written in the first standardized Bulgarian\u0000orthography: the Drinov orthography from the 19th century. We further develop a\u0000method for automatically generating synthetic data in this orthography, as well\u0000as in the subsequent Ivanchev orthography, by leveraging vast amounts of\u0000contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and\u0000encoder-decoder framework which we augment with diagonal attention loss and\u0000copy and coverage mechanisms to improve the post-OCR text correction. The\u0000proposed method reduces the errors introduced during recognition and improves\u0000the quality of the documents by 25%, which is an increase of 16% compared to\u0000the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data\u0000and code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models CLOCR-C：利用预训练语言模型进行上下文关联 OCR 更正

arXiv - CS - Digital Libraries Pub Date : 2024-08-30 DOI: arxiv-2408.17428

Jonathan Bourne

{"title":"CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models","authors":"Jonathan Bourne","doi":"arxiv-2408.17428","DOIUrl":"https://doi.org/arxiv-2408.17428","url":null,"abstract":"The digitisation of historical print media archives is crucial for increasing\u0000accessibility to contemporary records. However, the process of Optical\u0000Character Recognition (OCR) used to convert physical records to digital text is\u0000prone to errors, particularly in the case of newspapers and periodicals due to\u0000their complex layouts. This paper introduces Context Leveraging OCR Correction\u0000(CLOCR-C), which utilises the infilling and context-adaptive abilities of\u0000transformer-based language models (LMs) to improve OCR quality. The study aims\u0000to determine if LMs can perform post-OCR correction, improve downstream NLP\u0000tasks, and the value of providing the socio-cultural context as part of the\u0000correction process. Experiments were conducted using seven LMs on three\u0000datasets: the 19th Century Serials Edition (NCSE) and two datasets from the\u0000Overproof collection. The results demonstrate that some LMs can significantly\u0000reduce error rates, with the top-performing model achieving over a 60%\u0000reduction in character error rate on the NCSE dataset. The OCR improvements\u0000extend to downstream tasks, such as Named Entity Recognition, with increased\u0000Cosine Named Entity Similarity. Furthermore, the study shows that providing\u0000socio-cultural context in the prompts improves performance, while misleading\u0000prompts lower performance. In addition to the findings, this study releases a\u0000dataset of 91 transcribed articles from the NCSE, containing a total of 40\u0000thousand words, to support further research in this area. The findings suggest\u0000that CLOCR-C is a promising approach for enhancing the quality of existing\u0000digital archives by leveraging the socio-cultural information embedded in the\u0000LMs and the text requiring correction.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the Accuracy of the Labeling System in Web of Science for the Sustainable Development Goals 评估可持续发展目标科学网标签系统的准确性

arXiv - CS - Digital Libraries Pub Date : 2024-08-30 DOI: arxiv-2408.17084

Yu Zhao, Li Li, Zhesi Shen

{"title":"Evaluating the Accuracy of the Labeling System in Web of Science for the Sustainable Development Goals","authors":"Yu Zhao, Li Li, Zhesi Shen","doi":"arxiv-2408.17084","DOIUrl":"https://doi.org/arxiv-2408.17084","url":null,"abstract":"Monitoring and fostering research aligned with the Sustainable Development\u0000Goals (SDGs) is crucial for formulating evidence-based policies, identifying\u0000best practices, and promoting global collaboration. The key step is developing\u0000a labeling system to map research publications to their related SDGs. The SDGs\u0000labeling system integrated in Web of Science (WoS), which assigns citation\u0000topics instead of individual publication to SDGs, has emerged as a promising\u0000tool.However we still lack of a comprehensive evaluation of the performance of\u0000WoS labeling system. By comparing with the Bergon approach, we systematically\u0000assessed the relatedness between citation topics and SDGs. Our analysis\u0000identified 15% of topics showing low relatedness to their assigned SDGs at a 1%\u0000threshold. Notably, SDGs such as '11 Cities', '07 Energy', and '13 Climate'\u0000exhibited higher percentages of low related topics. In addition, we revealed\u0000that certain topics are significantly underrepresented in their relevant SDGs,\u0000particularly for '02 Hunger', '12 Consumption', and '15 Land'. This study\u0000underscores the critical need for continual refinement and validation of SDGs\u0000labeling systems in WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context μgat：通过提供多页面上下文改进单页面文档解析

arXiv - CS - Digital Libraries Pub Date : 2024-08-28 DOI: arxiv-2408.15646

Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara

{"title":"μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context","authors":"Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara","doi":"arxiv-2408.15646","DOIUrl":"https://doi.org/arxiv-2408.15646","url":null,"abstract":"Regesta are catalogs of summaries of other documents and, in some cases, are\u0000the only source of information about the content of such full-length documents.\u0000For this reason, they are of great interest to scholars in many social and\u0000humanities fields. In this work, we focus on Regesta Pontificum Romanum, a\u0000large collection of papal registers. Regesta are visually rich documents, where\u0000the layout is as important as the text content to convey the contained\u0000information through the structure, and are inherently multi-page documents.\u0000Among Digital Humanities techniques that can help scholars efficiently exploit\u0000regesta and other documental sources in the form of scanned documents, Document\u0000Parsing has emerged as a task to process document images and convert them into\u0000machine-readable structured representations, usually markup language. However,\u0000current models focus on scientific and business documents, and most of them\u0000consider only single-paged documents. To overcome this limitation, in this\u0000work, we propose {mu}gat, an extension of the recently proposed Document\u0000parsing Nougat architecture, which can handle elements spanning over the single\u0000page limits. Specifically, we adapt Nougat to process a larger, multi-page\u0000context, consisting of the previous and the following page, while parsing the\u0000current page. Experimental results, both qualitative and quantitative,\u0000demonstrate the effectiveness of our proposed approach also in the case of the\u0000challenging Regesta Pontificum Romanorum.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models LyCon：利用大型语言模型从词袋重建歌词

arXiv - CS - Digital Libraries Pub Date : 2024-08-27 DOI: arxiv-2408.14750

Haven Kim, Kahyun Choi

引用次数: 0

Transdisciplinary research: How much is academia heeding the call to work more closely with societal stakeholders such as industry, government, and nonprofits? 跨学科研究：学术界在多大程度上响应了与产业界、政府和非营利组织等社会利益相关者更紧密合作的号召？

arXiv - CS - Digital Libraries Pub Date : 2024-08-26 DOI: arxiv-2408.14024

Philip James Purnell

{"title":"Transdisciplinary research: How much is academia heeding the call to work more closely with societal stakeholders such as industry, government, and nonprofits?","authors":"Philip James Purnell","doi":"arxiv-2408.14024","DOIUrl":"https://doi.org/arxiv-2408.14024","url":null,"abstract":"Transdisciplinary research, the co-creation of scientific knowledge by\u0000multiple stakeholders, is considered essential for addressing major societal\u0000problems. Research policy makers and academic leaders frequently call for\u0000closer collaboration between academia and societal stakeholders to address the\u0000grand challenges of our time. This bibliometric study evaluates progress in\u0000collaboration between academia and three societal stakeholders: industry,\u0000government, and nonprofit organisations. It analyses the level of co-publishing\u0000between academia and these societal stakeholders over the period 2013-2022. We\u0000found that research collaboration between academia and all stakeholder types\u0000studied grew in absolute terms. However, academia-industry collaboration\u0000declined 16% relative to overall academic output while academia-government and\u0000academia-nonprofit collaboration grew at roughly the same pace as academic\u0000output. Country and field of research breakdowns revealed wide variance. In\u0000light of previous work, we consider potential explanations for the gap between\u0000policymakers' aspirations and the real global trends. This study is a useful\u0000demonstration of large scale, quantitative bibliometric techniques for research\u0000policymakers to track the impact of decisions related to funding, intellectual\u0000property law, and nonprofit support.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of Sustainable Development Goals Labeling Systems based on Topic Coverage 基于主题覆盖范围的可持续发展目标标签系统比较

arXiv - CS - Digital Libraries Pub Date : 2024-08-24 DOI: arxiv-2408.13455

Li Li, Yu Zhao, Zhesi Shen

引用次数: 0