The how of data

IF 0.8 Q2 EDUCATION & EDUCATIONAL RESEARCH

Teaching Statistics Pub Date : 2022-12-27 DOI:10.1111/test.12329

H. MacGillivray

{"title":"The how of data","authors":"H. MacGillivray","doi":"10.1111/test.12329","DOIUrl":null,"url":null,"abstract":"For many decades, professional statisticians and statistics educators have emphasized the central importance of identifying, taking account of, and reporting the 5 W's of data—What, Why, When, Where, and by Whom. If data are to be collected or accessed, we can add How—how can we obtain the data we need or want. The word “How” used broadly, can also encompass much of the 5 W's, as the What and Why are needed to understand How the necessary or desired data can be obtained, or were obtained. That these are all integral to statistics and statistics investigations has also been emphasized but it can never be highlighted enough that they should be at the heart of teaching statistics, no matter to whom or at what level. It can be a delight for teachers to discover this; I will always remember the excitement of senior school teachers learning this 30 years ago in hands-on professional development workshops— “You mean this is all part of statistics, not just preliminaries to statistics? Wow!”. Unfortunately, learning from discipline and/or teaching frontlines does not necessarily penetrate the citadel of educational authority. The question of the Who, the What, the How, and the How much of teaching statistics in education faculties, whether for future teachers or future research (where the multiple t-test tyranny appears to continue unchecked), is open for a different discussion. As the eras of big data and data science gradually grew and then exploded, the 5 W's and the How of data in teaching have “of course” become even more important and have received renewed attention, as commented by many authors, including in the 2021 special issue of Teaching Statistics. But as Shatz [6] reminds us in this issue, we should avoid saying “of course” and be ever mindful of the perpetual need to both explain and illuminate what statistics is, including that the central roles of the 5 W's and the How of data are of critical importance in real data science. In this issue, Lasater et al [2] highlight that “two critical learning elements now are working with complex publically-available datasets and choice and use of appropriate visualization in investigating multivariable data.” In [2], “These are the focus of the lab activity described here, set in an important social context.” Expansion to complex, large publically-available datasets and technologically intensive procedures does not mean relegation of other types of datasets or data collections. It just means the big tent of statistics and statistics teaching got even bigger. Collecting data, observing data, experimental design, and surveys still have major roles to play across all of statistics and its applications, and in teaching. But no matter what type or size of dataset, and no matter what the teaching context, without knowing, taking account of, and reporting on the 5 W's and the How of the data, analysis and interpretation may be compromised. Three articles in this issue provide excellent illustrations of this in different teaching and/or statistical contexts. All three focus on aspects of measurement and design, and all three demonstrate the critical importance of full knowledge of source, nature, and context of data. Whichever are the directions in which statistics and data science and their teaching go, instructors will continue to seek, as they always have, interesting real datasets and rich contexts to introduce, lead into, or illustrate statistical concepts, models, visualizations, technologies, methods, or analyses. Because of the nature of statistics, a variety of datasets for student experiential learning is invaluable. Since combining the use of subsets of larger and/or multivariable datasets, and of smaller more specific datasets, provides good pedagogical balance, instructors are always appreciative of resources of real datasets in real contexts with a specified number of variables of a specified type. In “Bare bones, or a rich feast?” [1], Sue Finch and Ian Gordon discuss the source information provided for datasets in the R “datasets” package, finding that for “69% there were obvious questions about units, factor levels, and/or design or measurement”, and do an extensive investigation into four potentially useful for teaching linear models with one or two categorical explanatory variables. Their findings are that impoverished data landscapes, sometimes even with potentially misleading or wrong contexts, can lead to “Sanitized versions of the reality behind the data fail(ing) to reflect the complexity and messiness that arise in practice” and “missed opportunities in teaching and learning” as well as credibility issues in analyses and interpretations. The authors conclude their investigations with some guidelines on the curation and documentation of datasets for teaching resources, with particular emphasis on measurement and design. DOI: 10.1111/test.12329","PeriodicalId":43739,"journal":{"name":"Teaching Statistics","volume":"45 1","pages":"1 - 3"},"PeriodicalIF":0.8000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Teaching Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/test.12329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

For many decades, professional statisticians and statistics educators have emphasized the central importance of identifying, taking account of, and reporting the 5 W's of data—What, Why, When, Where, and by Whom. If data are to be collected or accessed, we can add How—how can we obtain the data we need or want. The word “How” used broadly, can also encompass much of the 5 W's, as the What and Why are needed to understand How the necessary or desired data can be obtained, or were obtained. That these are all integral to statistics and statistics investigations has also been emphasized but it can never be highlighted enough that they should be at the heart of teaching statistics, no matter to whom or at what level. It can be a delight for teachers to discover this; I will always remember the excitement of senior school teachers learning this 30 years ago in hands-on professional development workshops— “You mean this is all part of statistics, not just preliminaries to statistics? Wow!”. Unfortunately, learning from discipline and/or teaching frontlines does not necessarily penetrate the citadel of educational authority. The question of the Who, the What, the How, and the How much of teaching statistics in education faculties, whether for future teachers or future research (where the multiple t-test tyranny appears to continue unchecked), is open for a different discussion. As the eras of big data and data science gradually grew and then exploded, the 5 W's and the How of data in teaching have “of course” become even more important and have received renewed attention, as commented by many authors, including in the 2021 special issue of Teaching Statistics. But as Shatz [6] reminds us in this issue, we should avoid saying “of course” and be ever mindful of the perpetual need to both explain and illuminate what statistics is, including that the central roles of the 5 W's and the How of data are of critical importance in real data science. In this issue, Lasater et al [2] highlight that “two critical learning elements now are working with complex publically-available datasets and choice and use of appropriate visualization in investigating multivariable data.” In [2], “These are the focus of the lab activity described here, set in an important social context.” Expansion to complex, large publically-available datasets and technologically intensive procedures does not mean relegation of other types of datasets or data collections. It just means the big tent of statistics and statistics teaching got even bigger. Collecting data, observing data, experimental design, and surveys still have major roles to play across all of statistics and its applications, and in teaching. But no matter what type or size of dataset, and no matter what the teaching context, without knowing, taking account of, and reporting on the 5 W's and the How of the data, analysis and interpretation may be compromised. Three articles in this issue provide excellent illustrations of this in different teaching and/or statistical contexts. All three focus on aspects of measurement and design, and all three demonstrate the critical importance of full knowledge of source, nature, and context of data. Whichever are the directions in which statistics and data science and their teaching go, instructors will continue to seek, as they always have, interesting real datasets and rich contexts to introduce, lead into, or illustrate statistical concepts, models, visualizations, technologies, methods, or analyses. Because of the nature of statistics, a variety of datasets for student experiential learning is invaluable. Since combining the use of subsets of larger and/or multivariable datasets, and of smaller more specific datasets, provides good pedagogical balance, instructors are always appreciative of resources of real datasets in real contexts with a specified number of variables of a specified type. In “Bare bones, or a rich feast?” [1], Sue Finch and Ian Gordon discuss the source information provided for datasets in the R “datasets” package, finding that for “69% there were obvious questions about units, factor levels, and/or design or measurement”, and do an extensive investigation into four potentially useful for teaching linear models with one or two categorical explanatory variables. Their findings are that impoverished data landscapes, sometimes even with potentially misleading or wrong contexts, can lead to “Sanitized versions of the reality behind the data fail(ing) to reflect the complexity and messiness that arise in practice” and “missed opportunities in teaching and learning” as well as credibility issues in analyses and interpretations. The authors conclude their investigations with some guidelines on the curation and documentation of datasets for teaching resources, with particular emphasis on measurement and design. DOI: 10.1111/test.12329

查看原文本刊更多论文

数据的方式

几十年来，专业统计学家和统计教育工作者一直强调识别、考虑和报告数据的5w的核心重要性——什么(what)、为什么(Why)、何时(When)、何地(Where)和由谁(who)。如果要收集或访问数据，我们可以添加how -我们如何获得我们需要或想要的数据。广泛使用的“如何”一词也可以包含5w的大部分内容，因为需要“What”和“Why”来理解如何获得或获得必要或期望的数据。这些都是统计学和统计调查的组成部分，这一点也被强调过，但无论对谁或在什么水平上，它们都应该成为统计学教学的核心，这一点再强调也不为过。老师们发现这一点会很高兴;我永远记得30年前高中老师在实践专业发展研讨会上学习这些知识时的兴奋——“你的意思是这都是统计学的一部分，而不仅仅是统计学的初级知识?”哇!”。不幸的是，从学科和/或教学前线学习并不一定能穿透教育权威的堡垒。谁，什么，怎么做，以及有多少教育部门的教学统计的问题，无论是对未来的教师还是未来的研究(多重t检验暴政似乎继续不受限制)，都是一个不同的讨论。随着大数据和数据科学时代的逐渐发展和爆发，正如许多作者(包括《教学统计》2021年特刊)所评论的那样，数据在教学中的5w和How“当然”变得更加重要，并受到了新的关注。但正如Shatz b[6]在本期中提醒我们的那样，我们应该避免说“当然”，要时刻注意解释和阐明统计学是什么，包括5w和How of data的核心作用在真正的数据科学中至关重要。在本期中，Lasater等人强调，“现在有两个关键的学习元素是处理复杂的公开数据集，以及在调查多变量数据时选择和使用适当的可视化。”在[2]中，“这些是这里描述的实验室活动的重点，设置在一个重要的社会背景下。”扩展到复杂的、大型的公开可用数据集和技术密集型程序并不意味着其他类型的数据集或数据集合的降级。这只是意味着统计学和统计学教学的大帐篷变得更大了。收集数据、观察数据、实验设计和调查在所有统计学及其应用和教学中仍然发挥着重要作用。但是，无论数据集的类型或大小如何，无论教学背景如何，如果不了解、考虑和报告数据的5w和How，分析和解释都可能受到损害。本期的三篇文章在不同的教学和/或统计背景下提供了很好的例证。这三种方法都侧重于测量和设计方面，并且都证明了充分了解数据的来源、性质和背景的重要性。无论统计学和数据科学及其教学的方向如何，教师都将一如既往地继续寻找有趣的真实数据集和丰富的背景，以介绍、引导或说明统计概念、模型、可视化、技术、方法或分析。由于统计学的本质，各种各样的数据集对于学生的体验式学习是无价的。由于结合使用较大和/或多变量数据集的子集，以及较小的更具体的数据集，提供了良好的教学平衡，教师总是欣赏真实环境中具有特定类型的特定数量变量的真实数据集资源。在《裸骨，还是丰盛的盛宴》中?[1]， Sue Finch和Ian Gordon讨论了R“数据集”包中为数据集提供的源信息，发现“69%的数据集存在关于单位、因素水平和/或设计或测量的明显问题”，并对四个具有一两个分类解释变量的线性模型进行了广泛的调查。他们的发现是，贫乏的数据环境，有时甚至是潜在的误导性或错误的背景，可能导致“数据背后的现实的净化版本无法反映实践中出现的复杂性和混乱”，“错失教学和学习的机会”，以及分析和解释中的可信度问题。作者总结了他们对教学资源数据集的管理和记录的一些指导方针，特别强调了测量和设计。DOI: 10.1111 / test.12329

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Teaching Statistics EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

2.10

自引率

25.00%

发文量