{"title":"Data Lake Strategy for Data Science Workflows","authors":"Elio Villaseñor García, Abel Alejandro Coronado Iruegas, Alejandro Esteban Pimentel Alarcón, Ranyart Rodrigo SuáRez Ponce De León, Alejandra Figueroa Martínez, Amado Esquer Martínez, Víctor Silva Cuevas, Irving Gibrán Cabrera Zamora, Edgar Oswaldo Díaz","doi":"10.1109/CIMPS57786.2022.10035694","DOIUrl":null,"url":null,"abstract":"This paper details the research and technological strategy carried out to implement a Data Lake and Sandboxes of the Data Science Laboratory at the National Institute of Statistics and Geography (INEGI) Mexico. This project seeks to integrate digital information from different repositories, data sources internal and external, which exist by the various entities that generate statistical and geographic information, in various formats to combine them in a unified storage environment (temporary or permanent), which allows advanced processes to be carried out with techniques oriented towards analytics and data science.","PeriodicalId":205829,"journal":{"name":"2022 11th International Conference On Software Process Improvement (CIMPS)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 11th International Conference On Software Process Improvement (CIMPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIMPS57786.2022.10035694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper details the research and technological strategy carried out to implement a Data Lake and Sandboxes of the Data Science Laboratory at the National Institute of Statistics and Geography (INEGI) Mexico. This project seeks to integrate digital information from different repositories, data sources internal and external, which exist by the various entities that generate statistical and geographic information, in various formats to combine them in a unified storage environment (temporary or permanent), which allows advanced processes to be carried out with techniques oriented towards analytics and data science.