{"title":"Exploring the applicability of Large Language Models to citation context analysis","authors":"Kai Nishikawa, Hitoshi Koshiba","doi":"arxiv-2409.02443","DOIUrl":"https://doi.org/arxiv-2409.02443","url":null,"abstract":"Unlike traditional citation analysis -- which assumes that all citations in a\u0000paper are equivalent -- citation context analysis considers the contextual\u0000information of individual citations. However, citation context analysis\u0000requires creating large amounts of data through annotation, which hinders the\u0000widespread use of this methodology. This study explored the applicability of\u0000Large Language Models (LLMs) -- particularly ChatGPT -- to citation context\u0000analysis by comparing LLMs and human annotation results. The results show that\u0000the LLMs annotation is as good as or better than the human annotation in terms\u0000of consistency but poor in terms of predictive performance. Thus, having LLMs\u0000immediately replace human annotators in citation context analysis is\u0000inappropriate. However, the annotation results obtained by LLMs can be used as\u0000reference information when narrowing the annotation results obtained by\u0000multiple human annotators to one, or LLMs can be used as one of the annotators\u0000when it is difficult to prepare sufficient human annotators. This study\u0000provides basic findings important for the future development of citation\u0000context analyses.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coverage and metadata availability of African publications in OpenAlex: A comparative analysis","authors":"Patricia Alonso-Alvarez, Nees Jan van Eck","doi":"arxiv-2409.01120","DOIUrl":"https://doi.org/arxiv-2409.01120","url":null,"abstract":"Unlike traditional proprietary data sources like Scopus and Web of Science\u0000(WoS), OpenAlex emphasizes its comprehensive coverage, particularly\u0000highlighting its inclusion of the humanities, non-English languages, and\u0000research from the Global South. Strengthening diversity and inclusivity in\u0000science is crucial for ethical and practical reasons. This paper analyses\u0000OpenAlex's coverage and metadata availability of African-based publications.\u0000For this purpose, we compare OpenAlex with Scopus, WoS, and African Journals\u0000Online (AJOL). We first compare the coverage of African research publications\u0000in OpenAlex against that of WoS, Scopus, and AJOL. We then assess and compare\u0000the available metadata for OpenAlex, Scopus, and WoS publications. Our analysis\u0000shows that OpenAlex offers the most extensive publication coverage. In terms of\u0000metadata, OpenAlex offers a high coverage of publication and author\u0000information. It performs worse regarding affiliations, references, and funder\u0000information. Importantly, our results also show that metadata availability in\u0000OpenAlex is better for publications that are also indexed in Scopus or WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simbanex: Similarity-based Exploration of IEEE VIS Publications","authors":"Daniel Witschard, Ilir Jusufi, Andreas Kerren","doi":"arxiv-2409.00478","DOIUrl":"https://doi.org/arxiv-2409.00478","url":null,"abstract":"Embeddings are powerful tools for transforming complex and unstructured data\u0000into numeric formats suitable for computational analysis tasks. In this work,\u0000we use multiple embeddings for similarity calculations to be applied in\u0000bibliometrics and scientometrics. We build a multivariate network (MVN) from a\u0000large set of scientific publications and explore an aspect-driven analysis\u0000approach to reveal similarity patterns in the given publication data. By\u0000dividing our MVN into separately embeddable aspects, we are able to obtain a\u0000flexible vector representation which we use as input to a novel method of\u0000similarity-based clustering. Based on these preprocessing steps, we developed a\u0000visual analytics application, called Simbanex, that has been designed for the\u0000interactive visual exploration of similarity patterns within the underlying\u0000publications.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"307 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Post-OCR Text Correction for Bulgarian Historical Documents","authors":"Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov","doi":"arxiv-2409.00527","DOIUrl":"https://doi.org/arxiv-2409.00527","url":null,"abstract":"The digitization of historical documents is crucial for preserving the\u0000cultural heritage of the society. An important step in this process is\u0000converting scanned images to text using Optical Character Recognition (OCR),\u0000which can enable further search, information extraction, etc. Unfortunately,\u0000this is a hard problem as standard OCR tools are not tailored to deal with\u0000historical orthography as well as with challenging layouts. Thus, it is\u0000standard to apply an additional text correction step on the OCR output when\u0000dealing with such documents. In this work, we focus on Bulgarian, and we create\u0000the first benchmark dataset for evaluating the OCR text correction for\u0000historical Bulgarian documents written in the first standardized Bulgarian\u0000orthography: the Drinov orthography from the 19th century. We further develop a\u0000method for automatically generating synthetic data in this orthography, as well\u0000as in the subsequent Ivanchev orthography, by leveraging vast amounts of\u0000contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and\u0000encoder-decoder framework which we augment with diagonal attention loss and\u0000copy and coverage mechanisms to improve the post-OCR text correction. The\u0000proposed method reduces the errors introduced during recognition and improves\u0000the quality of the documents by 25%, which is an increase of 16% compared to\u0000the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data\u0000and code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models","authors":"Jonathan Bourne","doi":"arxiv-2408.17428","DOIUrl":"https://doi.org/arxiv-2408.17428","url":null,"abstract":"The digitisation of historical print media archives is crucial for increasing\u0000accessibility to contemporary records. However, the process of Optical\u0000Character Recognition (OCR) used to convert physical records to digital text is\u0000prone to errors, particularly in the case of newspapers and periodicals due to\u0000their complex layouts. This paper introduces Context Leveraging OCR Correction\u0000(CLOCR-C), which utilises the infilling and context-adaptive abilities of\u0000transformer-based language models (LMs) to improve OCR quality. The study aims\u0000to determine if LMs can perform post-OCR correction, improve downstream NLP\u0000tasks, and the value of providing the socio-cultural context as part of the\u0000correction process. Experiments were conducted using seven LMs on three\u0000datasets: the 19th Century Serials Edition (NCSE) and two datasets from the\u0000Overproof collection. The results demonstrate that some LMs can significantly\u0000reduce error rates, with the top-performing model achieving over a 60%\u0000reduction in character error rate on the NCSE dataset. The OCR improvements\u0000extend to downstream tasks, such as Named Entity Recognition, with increased\u0000Cosine Named Entity Similarity. Furthermore, the study shows that providing\u0000socio-cultural context in the prompts improves performance, while misleading\u0000prompts lower performance. In addition to the findings, this study releases a\u0000dataset of 91 transcribed articles from the NCSE, containing a total of 40\u0000thousand words, to support further research in this area. The findings suggest\u0000that CLOCR-C is a promising approach for enhancing the quality of existing\u0000digital archives by leveraging the socio-cultural information embedded in the\u0000LMs and the text requiring correction.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the Accuracy of the Labeling System in Web of Science for the Sustainable Development Goals","authors":"Yu Zhao, Li Li, Zhesi Shen","doi":"arxiv-2408.17084","DOIUrl":"https://doi.org/arxiv-2408.17084","url":null,"abstract":"Monitoring and fostering research aligned with the Sustainable Development\u0000Goals (SDGs) is crucial for formulating evidence-based policies, identifying\u0000best practices, and promoting global collaboration. The key step is developing\u0000a labeling system to map research publications to their related SDGs. The SDGs\u0000labeling system integrated in Web of Science (WoS), which assigns citation\u0000topics instead of individual publication to SDGs, has emerged as a promising\u0000tool.However we still lack of a comprehensive evaluation of the performance of\u0000WoS labeling system. By comparing with the Bergon approach, we systematically\u0000assessed the relatedness between citation topics and SDGs. Our analysis\u0000identified 15% of topics showing low relatedness to their assigned SDGs at a 1%\u0000threshold. Notably, SDGs such as '11 Cities', '07 Energy', and '13 Climate'\u0000exhibited higher percentages of low related topics. In addition, we revealed\u0000that certain topics are significantly underrepresented in their relevant SDGs,\u0000particularly for '02 Hunger', '12 Consumption', and '15 Land'. This study\u0000underscores the critical need for continual refinement and validation of SDGs\u0000labeling systems in WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara
{"title":"μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context","authors":"Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara","doi":"arxiv-2408.15646","DOIUrl":"https://doi.org/arxiv-2408.15646","url":null,"abstract":"Regesta are catalogs of summaries of other documents and, in some cases, are\u0000the only source of information about the content of such full-length documents.\u0000For this reason, they are of great interest to scholars in many social and\u0000humanities fields. In this work, we focus on Regesta Pontificum Romanum, a\u0000large collection of papal registers. Regesta are visually rich documents, where\u0000the layout is as important as the text content to convey the contained\u0000information through the structure, and are inherently multi-page documents.\u0000Among Digital Humanities techniques that can help scholars efficiently exploit\u0000regesta and other documental sources in the form of scanned documents, Document\u0000Parsing has emerged as a task to process document images and convert them into\u0000machine-readable structured representations, usually markup language. However,\u0000current models focus on scientific and business documents, and most of them\u0000consider only single-paged documents. To overcome this limitation, in this\u0000work, we propose {mu}gat, an extension of the recently proposed Document\u0000parsing Nougat architecture, which can handle elements spanning over the single\u0000page limits. Specifically, we adapt Nougat to process a larger, multi-page\u0000context, consisting of the previous and the following page, while parsing the\u0000current page. Experimental results, both qualitative and quantitative,\u0000demonstrate the effectiveness of our proposed approach also in the case of the\u0000challenging Regesta Pontificum Romanorum.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models","authors":"Haven Kim, Kahyun Choi","doi":"arxiv-2408.14750","DOIUrl":"https://doi.org/arxiv-2408.14750","url":null,"abstract":"This paper addresses the unique challenge of conducting research in lyric\u0000studies, where direct use of lyrics is often restricted due to copyright\u0000concerns. Unlike typical data, internet-sourced lyrics are frequently protected\u0000under copyright law, necessitating alternative approaches. Our study introduces\u0000a novel method for generating copyright-free lyrics from publicly available\u0000Bag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the\u0000lyrics themselves. Utilizing metadata associated with BoW datasets and large\u0000language models, we successfully reconstructed lyrics. We have compiled and\u0000made available a dataset of reconstructed lyrics, LyCon, aligned with metadata\u0000from renowned sources including the Million Song Dataset, Deezer Mood Detection\u0000Dataset, and AllMusic Genre Dataset, available for public access. We believe\u0000that the integration of metadata such as mood annotations or genres enables a\u0000variety of academic experiments on lyrics, such as conditional lyric\u0000generation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transdisciplinary research: How much is academia heeding the call to work more closely with societal stakeholders such as industry, government, and nonprofits?","authors":"Philip James Purnell","doi":"arxiv-2408.14024","DOIUrl":"https://doi.org/arxiv-2408.14024","url":null,"abstract":"Transdisciplinary research, the co-creation of scientific knowledge by\u0000multiple stakeholders, is considered essential for addressing major societal\u0000problems. Research policy makers and academic leaders frequently call for\u0000closer collaboration between academia and societal stakeholders to address the\u0000grand challenges of our time. This bibliometric study evaluates progress in\u0000collaboration between academia and three societal stakeholders: industry,\u0000government, and nonprofit organisations. It analyses the level of co-publishing\u0000between academia and these societal stakeholders over the period 2013-2022. We\u0000found that research collaboration between academia and all stakeholder types\u0000studied grew in absolute terms. However, academia-industry collaboration\u0000declined 16% relative to overall academic output while academia-government and\u0000academia-nonprofit collaboration grew at roughly the same pace as academic\u0000output. Country and field of research breakdowns revealed wide variance. In\u0000light of previous work, we consider potential explanations for the gap between\u0000policymakers' aspirations and the real global trends. This study is a useful\u0000demonstration of large scale, quantitative bibliometric techniques for research\u0000policymakers to track the impact of decisions related to funding, intellectual\u0000property law, and nonprofit support.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison of Sustainable Development Goals Labeling Systems based on Topic Coverage","authors":"Li Li, Yu Zhao, Zhesi Shen","doi":"arxiv-2408.13455","DOIUrl":"https://doi.org/arxiv-2408.13455","url":null,"abstract":"With the growing importance of sustainable development goals (SDGs), various\u0000labeling systems have emerged for effective monitoring and evaluation. This\u0000study assesses six labeling systems across 1.85 million documents at both paper\u0000level and topic level. Our findings indicate that the SDGO and SDSN systems are\u0000more aggressive, while systems such as Auckland, Aurora, SIRIS, and Elsevier\u0000exhibit significant topic consistency, with similarity scores exceeding 0.75\u0000for most SDGs. However, similarities at the paper level generally fall short,\u0000particularly for specific SDGs like SDG 10. We highlight the crucial role of\u0000contextual information in keyword-based labeling systems, noting that\u0000overlooking context can introduce bias in the retrieval of papers (e.g.,\u0000variations in \"migration\" between biomedical and geographical contexts). These\u0000results reveal substantial discrepancies among SDG labeling systems,\u0000emphasizing the need for improved methodologies to enhance the accuracy and\u0000relevance of SDG evaluations.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"167 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}