{"title":"Reproducible Data Science with Python: An Open Learning Resource","authors":"V. Danchev","doi":"10.21105/jose.00156","DOIUrl":null,"url":null,"abstract":"Summary This paper describes a computational learning resource on Reproducible Data Science with Python. The resource provides an accessible, hands-on introduction to data science techniques, skills, and workflows necessary to perform open, reproducible, and ethical data analysis. By using research problems of real-world relevance (such as vaccine hesitancy and the impact of COVID-19 lockdown measures on human mobility) and real-world social data (including anonymised mobility data from digital sources and recent COVID-19 survey data), the resource encourages students to use open-source tools and coding to learn from diverse and large social data sources. The learning resource aims to minimise barriers to entry for students from social sciences, public health, and related fields. With no software installation and setup requirements, students can start coding from their web browser using free and open-source software (FOSS), including the Python programming language, Jupyter notebook, and Markdown. Through real-world data applications, students are introduced to the open source Python ecosystem of libraries for data science—including pandas (McKinney, 2010), seaborn (Waskom, 2021), scikit-learn (Pedregosa et al., 2011), statsmodels (Seabold & Perk-told, 2010), and networkX (Hagberg et al., 2008)—and learn about open and reproducible workflow, data wrangling, data exploration and visualization, pattern discovery (e.g., clustering), prediction and machine learning, causal inference, network analysis, and data ethics.","PeriodicalId":75094,"journal":{"name":"The Journal of open source education","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of open source education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21105/jose.00156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Summary This paper describes a computational learning resource on Reproducible Data Science with Python. The resource provides an accessible, hands-on introduction to data science techniques, skills, and workflows necessary to perform open, reproducible, and ethical data analysis. By using research problems of real-world relevance (such as vaccine hesitancy and the impact of COVID-19 lockdown measures on human mobility) and real-world social data (including anonymised mobility data from digital sources and recent COVID-19 survey data), the resource encourages students to use open-source tools and coding to learn from diverse and large social data sources. The learning resource aims to minimise barriers to entry for students from social sciences, public health, and related fields. With no software installation and setup requirements, students can start coding from their web browser using free and open-source software (FOSS), including the Python programming language, Jupyter notebook, and Markdown. Through real-world data applications, students are introduced to the open source Python ecosystem of libraries for data science—including pandas (McKinney, 2010), seaborn (Waskom, 2021), scikit-learn (Pedregosa et al., 2011), statsmodels (Seabold & Perk-told, 2010), and networkX (Hagberg et al., 2008)—and learn about open and reproducible workflow, data wrangling, data exploration and visualization, pattern discovery (e.g., clustering), prediction and machine learning, causal inference, network analysis, and data ethics.