{"title":"The how of data","authors":"H. MacGillivray","doi":"10.1111/test.12329","DOIUrl":null,"url":null,"abstract":"For many decades, professional statisticians and statistics educators have emphasized the central importance of identifying, taking account of, and reporting the 5 W's of data—What, Why, When, Where, and by Whom. If data are to be collected or accessed, we can add How—how can we obtain the data we need or want. The word “How” used broadly, can also encompass much of the 5 W's, as the What and Why are needed to understand How the necessary or desired data can be obtained, or were obtained. That these are all integral to statistics and statistics investigations has also been emphasized but it can never be highlighted enough that they should be at the heart of teaching statistics, no matter to whom or at what level. It can be a delight for teachers to discover this; I will always remember the excitement of senior school teachers learning this 30 years ago in hands-on professional development workshops— “You mean this is all part of statistics, not just preliminaries to statistics? Wow!”. Unfortunately, learning from discipline and/or teaching frontlines does not necessarily penetrate the citadel of educational authority. The question of the Who, the What, the How, and the How much of teaching statistics in education faculties, whether for future teachers or future research (where the multiple t-test tyranny appears to continue unchecked), is open for a different discussion. As the eras of big data and data science gradually grew and then exploded, the 5 W's and the How of data in teaching have “of course” become even more important and have received renewed attention, as commented by many authors, including in the 2021 special issue of Teaching Statistics. But as Shatz [6] reminds us in this issue, we should avoid saying “of course” and be ever mindful of the perpetual need to both explain and illuminate what statistics is, including that the central roles of the 5 W's and the How of data are of critical importance in real data science. In this issue, Lasater et al [2] highlight that “two critical learning elements now are working with complex publically-available datasets and choice and use of appropriate visualization in investigating multivariable data.” In [2], “These are the focus of the lab activity described here, set in an important social context.” Expansion to complex, large publically-available datasets and technologically intensive procedures does not mean relegation of other types of datasets or data collections. It just means the big tent of statistics and statistics teaching got even bigger. Collecting data, observing data, experimental design, and surveys still have major roles to play across all of statistics and its applications, and in teaching. But no matter what type or size of dataset, and no matter what the teaching context, without knowing, taking account of, and reporting on the 5 W's and the How of the data, analysis and interpretation may be compromised. Three articles in this issue provide excellent illustrations of this in different teaching and/or statistical contexts. All three focus on aspects of measurement and design, and all three demonstrate the critical importance of full knowledge of source, nature, and context of data. Whichever are the directions in which statistics and data science and their teaching go, instructors will continue to seek, as they always have, interesting real datasets and rich contexts to introduce, lead into, or illustrate statistical concepts, models, visualizations, technologies, methods, or analyses. Because of the nature of statistics, a variety of datasets for student experiential learning is invaluable. Since combining the use of subsets of larger and/or multivariable datasets, and of smaller more specific datasets, provides good pedagogical balance, instructors are always appreciative of resources of real datasets in real contexts with a specified number of variables of a specified type. In “Bare bones, or a rich feast?” [1], Sue Finch and Ian Gordon discuss the source information provided for datasets in the R “datasets” package, finding that for “69% there were obvious questions about units, factor levels, and/or design or measurement”, and do an extensive investigation into four potentially useful for teaching linear models with one or two categorical explanatory variables. Their findings are that impoverished data landscapes, sometimes even with potentially misleading or wrong contexts, can lead to “Sanitized versions of the reality behind the data fail(ing) to reflect the complexity and messiness that arise in practice” and “missed opportunities in teaching and learning” as well as credibility issues in analyses and interpretations. The authors conclude their investigations with some guidelines on the curation and documentation of datasets for teaching resources, with particular emphasis on measurement and design. DOI: 10.1111/test.12329","PeriodicalId":43739,"journal":{"name":"Teaching Statistics","volume":"45 1","pages":"1 - 3"},"PeriodicalIF":1.2000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Teaching Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/test.12329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
For many decades, professional statisticians and statistics educators have emphasized the central importance of identifying, taking account of, and reporting the 5 W's of data—What, Why, When, Where, and by Whom. If data are to be collected or accessed, we can add How—how can we obtain the data we need or want. The word “How” used broadly, can also encompass much of the 5 W's, as the What and Why are needed to understand How the necessary or desired data can be obtained, or were obtained. That these are all integral to statistics and statistics investigations has also been emphasized but it can never be highlighted enough that they should be at the heart of teaching statistics, no matter to whom or at what level. It can be a delight for teachers to discover this; I will always remember the excitement of senior school teachers learning this 30 years ago in hands-on professional development workshops— “You mean this is all part of statistics, not just preliminaries to statistics? Wow!”. Unfortunately, learning from discipline and/or teaching frontlines does not necessarily penetrate the citadel of educational authority. The question of the Who, the What, the How, and the How much of teaching statistics in education faculties, whether for future teachers or future research (where the multiple t-test tyranny appears to continue unchecked), is open for a different discussion. As the eras of big data and data science gradually grew and then exploded, the 5 W's and the How of data in teaching have “of course” become even more important and have received renewed attention, as commented by many authors, including in the 2021 special issue of Teaching Statistics. But as Shatz [6] reminds us in this issue, we should avoid saying “of course” and be ever mindful of the perpetual need to both explain and illuminate what statistics is, including that the central roles of the 5 W's and the How of data are of critical importance in real data science. In this issue, Lasater et al [2] highlight that “two critical learning elements now are working with complex publically-available datasets and choice and use of appropriate visualization in investigating multivariable data.” In [2], “These are the focus of the lab activity described here, set in an important social context.” Expansion to complex, large publically-available datasets and technologically intensive procedures does not mean relegation of other types of datasets or data collections. It just means the big tent of statistics and statistics teaching got even bigger. Collecting data, observing data, experimental design, and surveys still have major roles to play across all of statistics and its applications, and in teaching. But no matter what type or size of dataset, and no matter what the teaching context, without knowing, taking account of, and reporting on the 5 W's and the How of the data, analysis and interpretation may be compromised. Three articles in this issue provide excellent illustrations of this in different teaching and/or statistical contexts. All three focus on aspects of measurement and design, and all three demonstrate the critical importance of full knowledge of source, nature, and context of data. Whichever are the directions in which statistics and data science and their teaching go, instructors will continue to seek, as they always have, interesting real datasets and rich contexts to introduce, lead into, or illustrate statistical concepts, models, visualizations, technologies, methods, or analyses. Because of the nature of statistics, a variety of datasets for student experiential learning is invaluable. Since combining the use of subsets of larger and/or multivariable datasets, and of smaller more specific datasets, provides good pedagogical balance, instructors are always appreciative of resources of real datasets in real contexts with a specified number of variables of a specified type. In “Bare bones, or a rich feast?” [1], Sue Finch and Ian Gordon discuss the source information provided for datasets in the R “datasets” package, finding that for “69% there were obvious questions about units, factor levels, and/or design or measurement”, and do an extensive investigation into four potentially useful for teaching linear models with one or two categorical explanatory variables. Their findings are that impoverished data landscapes, sometimes even with potentially misleading or wrong contexts, can lead to “Sanitized versions of the reality behind the data fail(ing) to reflect the complexity and messiness that arise in practice” and “missed opportunities in teaching and learning” as well as credibility issues in analyses and interpretations. The authors conclude their investigations with some guidelines on the curation and documentation of datasets for teaching resources, with particular emphasis on measurement and design. DOI: 10.1111/test.12329