{"title":"Shacl4Bib: custom validation of library data","authors":"Péter Király","doi":"arxiv-2405.09177","DOIUrl":"https://doi.org/arxiv-2405.09177","url":null,"abstract":"The Shapes Constraint Language (SHACL) is a formal language for validating\u0000RDF graphs against a set of conditions. Following this idea and implementing a\u0000subset of the language, the Metadata Quality Assessment Framework provides\u0000Shacl4Bib: a mechanism to define SHACL-like rules for data sources in non-RDF\u0000based formats, such as XML, CSV and JSON. QA catalogue extends this concept\u0000further to MARC21, UNIMARC and PICA data. The criteria can be defined either\u0000with YAML or JSON configuration files or with Java code. Libraries can validate\u0000their data against criteria expressed in a unified language, that improves the\u0000clarity and the reusability of custom validation processes.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distinguishing articles in questionable and non-questionable journals using quantitative indicators associated with quality","authors":"Dimity Stephen","doi":"arxiv-2405.06308","DOIUrl":"https://doi.org/arxiv-2405.06308","url":null,"abstract":"This study investigates the viability of distinguishing articles in\u0000questionable journals (QJs) from those in non-QJs on the basis of quantitative\u0000indicators typically associated with quality. Subsequently, I examine what can\u0000be deduced about the quality of articles in QJs based on the differences\u0000observed. I contrast the length of abstracts and full-texts, prevalence of\u0000spelling errors, text readability, number of references and citations, the size\u0000and internationality of the author team, the documentation of ethics and\u0000informed consent statements, and the presence erroneous decisions based on\u0000statistical errors in 1,714 articles from 31 QJs, 1,691 articles from 16\u0000journals indexed in Web of Science (WoS), and 1,900 articles from 45 mid-tier\u0000journals, all in the field of psychology. The results suggest that QJ articles\u0000do diverge from the disciplinary standards set by peer-reviewed journals in\u0000psychology on quantitative indicators of quality that tend to reflect the\u0000effect of peer review and editorial processes. However, mid-tier and WoS\u0000journals are also affected by potential quality concerns, such as\u0000under-reporting of ethics and informed consent processes and the presence of\u0000errors in interpreting statistics. Further research is required to develop a\u0000comprehensive understanding of the quality of articles in QJs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"131 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can citations tell us about a paper's reproducibility? A case study of machine learning papers","authors":"Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu","doi":"arxiv-2405.03977","DOIUrl":"https://doi.org/arxiv-2405.03977","url":null,"abstract":"The iterative character of work in machine learning (ML) and artificial\u0000intelligence (AI) and reliance on comparisons against benchmark datasets\u0000emphasize the importance of reproducibility in that literature. Yet, resource\u0000constraints and inadequate documentation can make running replications\u0000particularly challenging. Our work explores the potential of using downstream\u0000citation contexts as a signal of reproducibility. We introduce a sentiment\u0000analysis framework applied to citation contexts from papers involved in Machine\u0000Learning Reproducibility Challenges in order to interpret the positive or\u0000negative outcomes of reproduction attempts. Our contributions include training\u0000classifiers for reproducibility-related contexts and sentiment analysis, and\u0000exploring correlations between citation context sentiment and reproducibility\u0000scores. Study data, software, and an artifact appendix are publicly available\u0000at https://github.com/lamps-lab/ccair-ai-reproducibility .","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NACSOS-nexus: NLP Assisted Classification, Synthesis and Online Screening with New and EXtended Usage Scenarios","authors":"Tim Repke, Max Callaghan","doi":"arxiv-2405.04621","DOIUrl":"https://doi.org/arxiv-2405.04621","url":null,"abstract":"NACSOS is a web-based platform for curating data used in systematic maps. It\u0000contains several (experimental) features that aid the evidence synthesis\u0000process from finding and ingesting primary data (mainly scientific\u0000publications), basic search and exploration thereof, but mainly the handling of\u0000managing the manual and automated annotations. The platform supports\u0000prioritised screening algorithms and is the first to fully implement\u0000statistical stopping criteria. Annotations by multiple coders can be resolved\u0000and customisable quality metrics are computed on-the-fly. In its current state,\u0000the annotations are performed on document level. The ecosystem around NACSOS\u0000offers packages for accessing the underlying database and practical utility\u0000functions that have proven useful in a multitude of projects. Further, it\u0000provides the backbone of living maps, review ecosystems, and our public\u0000literature hub for sharing high-quality curated corpora.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research information in the light of artificial intelligence: quality and data ecologies","authors":"Otmane Azeroual, Tibor Koltay","doi":"arxiv-2405.12997","DOIUrl":"https://doi.org/arxiv-2405.12997","url":null,"abstract":"This paper presents multi- and interdisciplinary approaches for finding the\u0000appropriate AI technologies for research information. Professional research\u0000information management (RIM) is becoming increasingly important as an expressly\u0000data-driven tool for researchers. It is not only the basis of scientific\u0000knowledge processes, but also related to other data. A concept and a process\u0000model of the elementary phases from the start of the project to the ongoing\u0000operation of the AI methods in the RIM is presented, portraying the\u0000implementation of an AI project, meant to enable universities and research\u0000institutions to support their researchers in dealing with incorrect and\u0000incomplete research information, while it is being stored in their RIMs. Our\u0000aim is to show how research information harmonizes with the challenges of data\u0000literacy and data quality issues, related to AI, also wanting to underline that\u0000any project can be successful if the research institutions and various\u0000departments of universities, involved work together and appropriate support is\u0000offered to improve research information and data management.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the performativity of SDG classifications in large bibliometric databases","authors":"Matteo Ottaviani, Stephan Stahlschmidt","doi":"arxiv-2405.03007","DOIUrl":"https://doi.org/arxiv-2405.03007","url":null,"abstract":"Large bibliometric databases, such as Web of Science, Scopus, and OpenAlex,\u0000facilitate bibliometric analyses, but are performative, affecting the\u0000visibility of scientific outputs and the impact measurement of participating\u0000entities. Recently, these databases have taken up the UN's Sustainable\u0000Development Goals (SDGs) in their respective classifications, which have been\u0000criticised for their diverging nature. This work proposes using the feature of\u0000large language models (LLMs) to learn about the \"data bias\" injected by diverse\u0000SDG classifications into bibliometric data by exploring five SDGs. We build a\u0000LLM that is fine-tuned in parallel by the diverse SDG classifications inscribed\u0000into the databases' SDG classifications. Our results show high sensitivity in\u0000model architecture, classified publications, fine-tuning process, and natural\u0000language generation. The wide arbitrariness at different levels raises concerns\u0000about using LLM in research practice.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman
{"title":"Assembling ensembling: An adventure in approaches across disciplines","authors":"Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman","doi":"arxiv-2405.02599","DOIUrl":"https://doi.org/arxiv-2405.02599","url":null,"abstract":"When we think of model ensembling or ensemble modeling, there are many\u0000possibilities that come to mind in different disciplines. For example, one\u0000might think of a set of descriptions of a phenomenon in the world, perhaps a\u0000time series or a snapshot of multivariate space, and perhaps that set is\u0000comprised of data-independent descriptions, or perhaps it is quite\u0000intentionally fit *to* data, or even a suite of data sets with a common theme\u0000or intention. The very meaning of 'ensemble' - a collection together - conjures\u0000different ideas across and even within disciplines approaching phenomena. In\u0000this paper, we present a typology of the scope of these potential perspectives.\u0000It is not our goal to present a review of terms and concepts, nor is it to\u0000convince all disciplines to adopt a common suite of terms, which we view as\u0000futile. Rather, our goal is to disambiguate terms, concepts, and processes\u0000associated with 'ensembles' and 'ensembling' in order to facilitate\u0000communication, awareness, and possible adoption of tools across disciplines.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Workflow for GLAM Metadata Crosswalk","authors":"Arianna Moretti, Ivan Heibi, Silvio Peroni","doi":"arxiv-2405.02113","DOIUrl":"https://doi.org/arxiv-2405.02113","url":null,"abstract":"The acquisition of physical artifacts not only involves transferring existing\u0000information into the digital ecosystem but also generates information as a\u0000process itself, underscoring the importance of meticulous management of FAIR\u0000data and metadata. In addition, the diversity of objects within the cultural\u0000heritage domain is reflected in a multitude of descriptive models. The\u0000digitization process expands the opportunities for exchange and joint\u0000utilization, granted that the descriptive schemas are made interoperable in\u0000advance. To achieve this goal, we propose a replicable workflow for metadata\u0000schema crosswalks that facilitates the preservation and accessibility of\u0000cultural heritage in the digital ecosystem. This work presents a methodology\u0000for metadata generation and management in the case study of the digital twin of\u0000the temporary exhibition \"The Other Renaissance - Ulisse Aldrovandi and the\u0000Wonders of the World\". The workflow delineates a systematic, step-by-step\u0000transformation of tabular data into RDF format, to enhance Linked Open Data.\u0000The methodology adopts the RDF Mapping Language (RML) technology for converting\u0000data to RDF with a human contribution involvement. This last aspect entails an\u0000interaction between digital humanists and domain experts through surveys\u0000leading to the abstraction and reformulation of domain-specific knowledge, to\u0000be exploited in the process of formalizing and converting information.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie
{"title":"Callico: a Versatile Open-Source Document Image Annotation Platform","authors":"Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie","doi":"arxiv-2405.01071","DOIUrl":"https://doi.org/arxiv-2405.01071","url":null,"abstract":"This paper presents Callico, a web-based open source platform designed to\u0000simplify the annotation process in document recognition projects. The move\u0000towards data-centric AI in machine learning and deep learning underscores the\u0000importance of high-quality data, and the need for specialised tools that\u0000increase the efficiency and effectiveness of generating such data. For document\u0000image annotation, Callico offers dual-display annotation for digitised\u0000documents, enabling simultaneous visualisation and annotation of scanned images\u0000and text. This capability is critical for OCR and HTR model training, document\u0000layout analysis, named entity recognition, form-based key value annotation or\u0000hierarchical structure annotation with element grouping. The platform supports\u0000collaborative annotation with versatile features backed by a commitment to open\u0000source development, high-quality code standards and easy deployment via Docker.\u0000Illustrative use cases - including the transcription of the Belfort municipal\u0000registers, the indexing of French World War II prisoners for the ICRC, and the\u0000extraction of personal information from the Socface project's census lists -\u0000demonstrate Callico's applicability and utility.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick
{"title":"Clustering Running Titles to Understand the Printing of Early Modern Books","authors":"Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick","doi":"arxiv-2405.00752","DOIUrl":"https://doi.org/arxiv-2405.00752","url":null,"abstract":"We propose a novel computational approach to automatically analyze the\u0000physical process behind printing of early modern letterpress books via\u0000clustering the running titles found at the top of their pages. Specifically, we\u0000design and compare custom neural and feature-based kernels for computing\u0000pairwise visual similarity of a scanned document's running titles and cluster\u0000the titles in order to track any deviations from the expected pattern of a\u0000book's printing. Unlike body text which must be reset for every page, the\u0000running titles are one of the static type elements in a skeleton forme i.e. the\u0000frame used to print each side of a sheet of paper, and were often re-used\u0000during a book's printing. To evaluate the effectiveness of our approach, we\u0000manually annotate the running title clusters on about 1600 pages across 8 early\u0000modern books of varying size and formats. Our method can detect potential\u0000deviation from the expected patterns of such skeleton formes, which helps\u0000bibliographers understand the phenomena associated with a text's transmission,\u0000such as censorship. We also validate our results against a manual bibliographic\u0000analysis of a counterfeit early edition of Thomas Hobbes' Leviathan (1651).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}