Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson
{"title":"Research Citations Building Trust in Wikipedia","authors":"Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson","doi":"arxiv-2409.11948","DOIUrl":"https://doi.org/arxiv-2409.11948","url":null,"abstract":"The use of Wikipedia citations in scholarly research has been the topic of\u0000much inquiry over the past decade. A cross-publisher study (Taylor & Francis\u0000and University of Michigan Press) convened by Digital Science was established\u0000in late 2022 to explore author sentiment towards Wikipedia as a trusted source\u0000of information. A short survey was designed to poll published authors about\u0000views and uses of Wikipedia and explore how the increased addition of research\u0000citations in Wikipedia might help combat misinformation in the context of\u0000increasing public engagement with and access to validated research sources.\u0000With 21,854 surveys sent, targeting 40,402 papers mentioned in Wikipedia, a\u0000total of 750 complete surveys from 60 countries were included in this analysis.\u0000In general, responses revealed a positive sentiment towards research citation\u0000in Wikipedia and the researcher engagement practices. However, our sub analysis\u0000revealed statistically significant differences when comparison articles vs\u0000books and across disciplines, but not open vs closed access. This study will\u0000open the door to further research and deepen our understanding of authors\u0000perceived trustworthiness of the representation of their research in Wikipedia.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and \"Home Venues\"","authors":"Teddy Lazebnik, Shir Aviv-Reuven, Ariel Rosenfeld","doi":"arxiv-2409.12158","DOIUrl":"https://doi.org/arxiv-2409.12158","url":null,"abstract":"Scholarly communication is vital to scientific advancement, enabling the\u0000exchange of ideas and knowledge. When selecting publication venues, scholars\u0000consider various factors, such as journal relevance, reputation, outreach, and\u0000editorial standards and practices. However, some of these factors are\u0000inconspicuous or inconsistent across venues and individual publications. This\u0000study proposes that scholars' decision-making process can be conceptualized and\u0000explored through the biologically inspired exploration-exploitation (EE)\u0000framework, which posits that scholars balance between familiar and\u0000under-explored publication venues. Building on the EE framework, we introduce a\u0000grounded definition for \"Home Venues\" (HVs) - an informal concept used to\u0000describe the set of venues where a scholar consistently publishes - and\u0000investigate their emergence and key characteristics. Our analysis reveals that\u0000the publication patterns of roughly three-quarters of computer science scholars\u0000align with the expectations of the EE framework. For these scholars, HVs\u0000typically emerge and stabilize after approximately 15-20 publications.\u0000Additionally, scholars with higher h-indexes or a greater number of\u0000publications, tend to have higher-ranking journals as their HVs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière
{"title":"Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness","authors":"Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière","doi":"arxiv-2409.10633","DOIUrl":"https://doi.org/arxiv-2409.10633","url":null,"abstract":"Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades\u0000the main sources of bibliometric information. Although highly curated, these\u0000closed, proprietary databases are largely biased towards English-language\u0000publications, underestimating the use of other languages in research\u0000dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,\u0000and open-source research information. While already in use by scholars and\u0000research institutions, the quality of its metadata is currently being assessed.\u0000This paper contributes to this literature by assessing the completeness and\u0000accuracy of its metadata related to language, through a comparison with WoS, as\u0000well as an in-depth manual validation of a sample of 6,836 articles. Results\u0000show that OpenAlex exhibits a far more balanced linguistic coverage than WoS.\u0000However, language metadata is not always accurate, which leads OpenAlex to\u0000overestimate the place of English while underestimating that of other\u0000languages. If used critically, OpenAlex can provide comprehensive and\u0000representative analyses of languages used for scholarly publishing. However,\u0000more work is needed at infrastructural level to ensure the quality of metadata\u0000on language.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards understanding evolution of science through language model series","authors":"Junjie Dong, Zhuoqi Lyu, Qing Ke","doi":"arxiv-2409.09636","DOIUrl":"https://doi.org/arxiv-2409.09636","url":null,"abstract":"We introduce AnnualBERT, a series of language models designed specifically to\u0000capture the temporal evolution of scientific text. Deviating from the\u0000prevailing paradigms of subword tokenizations and \"one model to rule them all\",\u0000AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\u0000pretrained from scratch on the full-text of 1.7 million arXiv papers published\u0000until 2008 and a collection of progressively trained models on arXiv papers at\u0000an annual basis. We demonstrate the effectiveness of AnnualBERT models by\u0000showing that they not only have comparable performances in standard tasks but\u0000also achieve state-of-the-art performances on domain-specific NLP tasks as well\u0000as link prediction tasks in the arXiv citation network. We then utilize probing\u0000tasks to quantify the models' behavior in terms of representation learning and\u0000forgetting as time progresses. Our approach enables the pretrained models to\u0000not only improve performances on scientific text processing tasks but also to\u0000provide insights into the development of scientific discourse over time. The\u0000series of the models is available at https://huggingface.co/jd445/AnnualBERTs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen
{"title":"Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets","authors":"Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen","doi":"arxiv-2409.08897","DOIUrl":"https://doi.org/arxiv-2409.08897","url":null,"abstract":"Scientists increasingly recognize the importance of providing rich,\u0000standards-adherent metadata to describe their experimental results. Despite the\u0000availability of sophisticated tools to assist in the process of data\u0000annotation, investigators generally seem to prefer to use spreadsheets when\u0000supplying metadata, despite the limitations of spreadsheets in ensuring\u0000metadata consistency and compliance with formal specifications. In this paper,\u0000we describe an end-to-end approach that supports spreadsheet-based entry of\u0000metadata, while ensuring rigorous adherence to community-based metadata\u0000standards and providing quality control. Our methods employ several key\u0000components, including customizable templates that capture metadata standards\u0000and that can inform the spreadsheets that investigators use to author metadata,\u0000controlled terminologies and ontologies for defining metadata values that can\u0000be accessed directly from a spreadsheet, and an interactive Web-based tool that\u0000allows users to rapidly identify and fix errors in their spreadsheet-based\u0000metadata. We demonstrate how this approach is being deployed in a biomedical\u0000consortium known as HuBMAP to define and collect metadata about a wide range of\u0000biological assays.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intelligent Innovation Dataset on Scientific Research Outcomes and Patents","authors":"Xinran Wu, Hui Zou, Yidan Xing, Jingjing Qu, Qiongxiu Li, Renxia Xue, Xiaoming Fu","doi":"arxiv-2409.06936","DOIUrl":"https://doi.org/arxiv-2409.06936","url":null,"abstract":"Various stakeholders, such as researchers, government agencies, businesses,\u0000and laboratories require reliable scientific research outcomes and patent data\u0000to support their work. These data are crucial for advancing scientific\u0000research, conducting business evaluations, and policy analysis. However,\u0000collecting such data is often a time-consuming and laborious task.\u0000Consequently, many users turn to using openly accessible data for their\u0000research. However, these open data releases may suffer from lack of\u0000relationship between different data sources or limited temporal coverage. In\u0000this context, we present a new Intelligent Innovation Dataset (IIDS dataset),\u0000which comprises six inter-related datasets spanning nearly 120 years,\u0000encompassing paper information, paper citation relationships, patent details,\u0000patent legal statuses, funding information and funding relationship. The\u0000extensive contextual and extensive temporal coverage of the IIDS dataset will\u0000provide researchers with comprehensive data support, enabling them to delve\u0000into in-depth scientific research and conduct thorough data analysis.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Evaluation of GPT-4V for Transcribing the Urban Renewal Hand-Written Collection","authors":"Myeong Lee, Julia H. P. Hsu","doi":"arxiv-2409.09090","DOIUrl":"https://doi.org/arxiv-2409.09090","url":null,"abstract":"Between 1960 and 1980, urban renewal transformed many cities, creating vast\u0000handwritten records. These documents posed a significant challenge for\u0000researchers due to their volume and handwritten nature. The launch of GPT-4V in\u0000November 2023 offered a breakthrough, enabling large-scale, efficient\u0000transcription and analysis of these historical urban renewal documents.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac
{"title":"The existence of stealth corrections in scientific literature -- a threat to scientific integrity","authors":"Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac","doi":"arxiv-2409.06852","DOIUrl":"https://doi.org/arxiv-2409.06852","url":null,"abstract":"Introduction: Thorough maintenance of the scientific record is needed to\u0000ensure the trustworthiness of its content. This can be undermined by a stealth\u0000correction, which is at least one post-publication change made to a scientific\u0000article, without providing a correction note or any other indicator that the\u0000publication was temporarily or permanently altered. In this paper we provide\u0000several examples of stealth corrections in order to demonstrate that these\u0000exist within the scientific literature. As far as we are aware, no\u0000documentation of such stealth corrections was previously reported in the\u0000scientific literature. Methods: We identified stealth corrections ourselves, or found already\u0000reported ones on the public database pubpeer.com or through social media\u0000accounts of known science sleuths. Results: In total we report 131 articles that were affected by stealth\u0000corrections and were published between 2005 and 2024. These stealth corrections\u0000were found among multiple publishers and scientific fields. Conclusion: and recommendations Stealth corrections exist in the scientific\u0000literature. This needs to end immediately as it threatens scientific integrity.\u0000We recommend the following: 1) Tracking all changes to the published record by\u0000all publishers in an open, uniform and transparent manner, preferably by online\u0000submission systems that log every change publicly, making stealth corrections\u0000impossible; 2) Clear definitions and guidelines on all types of corrections; 3)\u0000Support sustained vigilance of the scientific community to publicly register\u0000stealth corrections.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen
{"title":"Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization","authors":"Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen","doi":"arxiv-2409.06433","DOIUrl":"https://doi.org/arxiv-2409.06433","url":null,"abstract":"The increasing amount of published scholarly articles, exceeding 2.5 million\u0000yearly, raises the challenge for researchers in following scientific progress.\u0000Integrating the contributions from scholarly articles into a novel type of\u0000cognitive knowledge graph (CKG) will be a crucial element for accessing and\u0000organizing scholarly knowledge, surpassing the insights provided by titles and\u0000abstracts. This research focuses on effectively conveying structured scholarly\u0000knowledge by utilizing large language models (LLMs) to categorize scholarly\u0000articles and describe their contributions in a structured and comparable\u0000manner. While previous studies explored language models within specific\u0000research domains, the extensive domain-independent knowledge captured by LLMs\u0000offers a substantial opportunity for generating structured contribution\u0000descriptions as CKGs. Additionally, LLMs offer customizable pathways through\u0000prompt engineering or fine-tuning, thus facilitating to leveraging of smaller\u0000LLMs known for their efficiency, cost-effectiveness, and environmental\u0000considerations. Our methodology involves harnessing LLM knowledge, and\u0000complementing it with domain expert-verified scholarly data sourced from a CKG.\u0000This strategic fusion significantly enhances LLM performance, especially in\u0000tasks like scholarly article categorization and predicate recommendation. Our\u0000method involves fine-tuning LLMs with CKG knowledge and additionally injecting\u0000knowledge from a CKG with a novel prompting technique significantly increasing\u0000the accuracy of scholarly knowledge extraction. We integrated our approach in\u0000the Open Research Knowledge Graph (ORKG), thus enabling precise access to\u0000organized scholarly knowledge, crucially benefiting domain-independent\u0000scholarly knowledge exchange and dissemination among policymakers, industrial\u0000practitioners, and the general public.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert
{"title":"The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review","authors":"Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert","doi":"arxiv-2409.04600","DOIUrl":"https://doi.org/arxiv-2409.04600","url":null,"abstract":"Objective: This study aims to summarize the usage of Large Language Models\u0000(LLMs) in the process of creating a scientific review. We look at the range of\u0000stages in a review that can be automated and assess the current\u0000state-of-the-art research projects in the field. Materials and Methods: The\u0000search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google\u0000Scholar databases by human reviewers. Screening and extraction process took\u0000place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.\u0000ChatGPT was used to clean extracted data and generate code for figures in this\u0000manuscript, ChatGPT and Scite.ai were used in drafting all components of the\u0000manuscript, except the methods and discussion sections. Results: 3,788 articles\u0000were retrieved, and 172 studies were deemed eligible for the final review.\u0000ChatGPT and GPT-based LLM emerged as the most dominant architecture for review\u0000automation (n=126, 73.2%). A significant number of review automation projects\u0000were found, but only a limited number of papers (n=26, 15.1%) were actual\u0000reviews that used LLM during their creation. Most citations focused on\u0000automation of a particular stage of review, such as Searching for publications\u0000(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled\u0000performance of GPT-based and BERT-based models, the former were better in data\u0000extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),\u0000while being slightly less accurate in title and abstract screening stage\u0000(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic\u0000review revealed a significant number of research projects related to review\u0000automation using LLMs. The results looked promising, and we anticipate that\u0000LLMs will change in the near future the way the scientific reviews are\u0000conducted.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}