Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray
{"title":"解决大学知识库中论文中难以获取的未发表数据问题。","authors":"Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray","doi":"10.1111/gwat.70007","DOIUrl":null,"url":null,"abstract":"<p><p>Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.</p>","PeriodicalId":94022,"journal":{"name":"Ground water","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Addressing the Problem of Hard-to-Reach Unpublished Data from Theses in University Repositories.\",\"authors\":\"Héctor L Venegas-Quiñones, Pablo A Garcia-Chevesich, Madeleine Guillen, Francisco Alejo, John E McCray\",\"doi\":\"10.1111/gwat.70007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.</p>\",\"PeriodicalId\":94022,\"journal\":{\"name\":\"Ground water\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ground water\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1111/gwat.70007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ground water","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/gwat.70007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Addressing the Problem of Hard-to-Reach Unpublished Data from Theses in University Repositories.
Researchers frequently encounter challenges in accessing valuable data encapsulated within university theses, which are predominantly archived in PDF format and remain unpublished in repositories. These documents often encompass original research, including vital environmental and hydrological data, yet they pose difficulties for searching or analysis due to inconsistent formatting and inefficient repository search tools such as keyword searches, which lead to an overwhelming list of documents. Our research team, engaged in developing a groundwater database for the Arequipa region of Peru, encountered this issue directly, with numerous relevant theses dispersed across local university repositories. The manual review process proved excessively time-consuming, necessitating the development of an innovative, automated solution. Our multi-step methodology commenced with optical character recognition (OCR) and Python scripts for keyword scoring, followed by the employment of Large Language Models (LLMs), notably Google's Gemini and the locally hosted Ollama, to semantically analyze content. This facilitated the identification and extraction of pertinent data (e.g., water quality parameters, well locations) and its organization into usable formats such as Excel spreadsheets; subsequent manual checks confirmed a high level of accuracy. The final system enables users to query an extensive number of documents swiftly and contextually, effectively overcoming traditional keyword search limitations. The tool is presently being disseminated among local researchers and institutions, offering a robust solution for accessing and managing regional groundwater data. This methodology possesses the potential for global scaling and adaptation, thereby enhancing access to gray literature and expediting scientific discovery across various disciplines.