解决大学知识库中论文中难以获取的未发表数据问题。

Ground water Pub Date : 2025-07-16 DOI:10.1111/gwat.70007
Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray
{"title":"解决大学知识库中论文中难以获取的未发表数据问题。","authors":"Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray","doi":"10.1111/gwat.70007","DOIUrl":null,"url":null,"abstract":"<p><p>Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.</p>","PeriodicalId":94022,"journal":{"name":"Ground water","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Addressing the Problem of Hard-to-Reach Unpublished Data from Theses in University Repositories.\",\"authors\":\"Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray\",\"doi\":\"10.1111/gwat.70007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.</p>\",\"PeriodicalId\":94022,\"journal\":{\"name\":\"Ground water\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ground water\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1111/gwat.70007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ground water","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/gwat.70007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

研究人员经常在访问封装在大学论文中的有价值的数据时遇到挑战,这些数据主要以PDF格式存档,并且在存储库中未发表。这些文件通常包含原始研究,包括重要的环境和水文数据,但由于格式不一致和低效的存储库搜索工具(如关键字搜索),它们给搜索或分析带来了困难,导致文件列表压倒性。我们的研究团队在为秘鲁阿雷基帕地区开发地下水数据库时,直接遇到了这个问题,许多相关论文分散在当地大学的知识库中。手工审查过程被证明过于耗时,需要开发一种创新的、自动化的解决方案。我们的多步骤方法从光学字符识别(OCR)和Python脚本开始,用于关键字评分,然后使用大型语言模型(llm),特别是谷歌的Gemini和本地托管的Ollama,对内容进行语义分析。这有助于识别和提取相关数据(例如水质参数、井位),并将其组织成可用的格式,如Excel电子表格;随后的手动检查证实了高水平的准确性。最后的系统使用户能够快速和上下文查询大量的文档,有效地克服了传统的关键字搜索限制。该工具目前正在当地研究人员和机构之间传播,为获取和管理区域地下水数据提供了一个强有力的解决方案。这种方法具有全球扩展和适应的潜力,从而增加了对灰色文献的获取,并加快了跨学科的科学发现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Addressing the Problem of Hard-to-Reach Unpublished Data from Theses in University Repositories.

Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信