Chapter 6 Educational Resources
After analyzing the retrospective study, we provided two resources intended for use in statistical instruction to reflect our insights in favor of the tidyverse syntax. We altered current STA 101 labs to further adhere to the GAISE manual and provided two code samples with explanations for infer
, an R package created to perform inference tasks using a tidy framework.
6.1 Infer Vignettes
Currently, the tidyverse contains packages to clean, mutate, model and visualize data, but not to perform inference tasks. Due to the popularity and demand for the tidyverse framework, a group of professors created a relatively new package titled infer
, which is fashioned as the tidyverse solution to inference. Like packages such as dplyr
and ggplot2
, infer
relies on a few commands to execute a variety of tasks and benefits from the piping structure used throughout the tidyverse.
Unfortunately, infer
does not presently include detailed examples of the package’s usage. Therefore, we decided to write two vignettes—longer code samples embedded within data analysis stories—to provide infer
users with vivid examples and explanations for proper usage of the package.
By inference, the two primary endeavors are performing hypothesis tests and creating confidence intervals—hence a vignette for each. The vignettes also utilize separate aspects of the same dataset, to provide users the comfort of a consistent data source while exposing them to tasks using both numerical and categorical variables.
The complete vignettes are available in the Appendix.
6.2 Lab Enhancements
Based on the results gleaned from analyzing the student project data from introductory statistics courses at Duke University, we decided to fully apply tidyverse syntax and methods to introductory statistics labs. OpenIntro Statistics is an open-source textbook approved for use at the undergraduate level by the American Institute of Mathematics, and the labs accompanying OpenIntro textbooks are available online and are used by students at many institutions, including at Duke University. These labs were updated in 2016 to incorporate tidyverse syntax for data visualization and wrangling, though statistical inference still relied on Base R syntax. As part of this project, we have updated the code introduced in the labs to fully leverage the Tidyverse ecosystem, including the infer
package.
Since the most recent update of the labs to incorporate tidyverse syntax, recommendations around introducing data visualization with ggplot2
have changed. Previous practice used the qplot()
(quick plot) function, which has a simpler API than the ggplot()
function, but is more cumbersome to produce complex multivariate visualizations with. With this update, we have completely abandoned the use of qplot()
and replaced it with ggplot()
, resulting in changes in associated code. Two main reasons for this change are the existence of plethora of resources for debugging ggplot()
as well as ease of expansion to complex visualizations.
Another major update was in the labs that focus on statistical inference. Previous versions of the labs used a custom function called inference()
from the oilabs
package, the R package used to supplement OpenIntro. This function, designed with the best intentions in mind for highlighting the unified nature of statistical inference across various hypothesis tests and confidence intervals introduced in introductory statistics curricula, over time morphed into a function that is too extensive for efficient debugging and too customized for use beyond the introductory statistics classroom. The infer package, released in 2018, was heavily inspired by the inference()
function, but significantly improved the API for tidy statistical inference and tied it closely to how both we and the GAISE recommend introducing statistical inference in introductory statistics curricula. The goal of this package “is to perform inference using an expressive statistical grammar that coheres with the tidy design framework,” as per its R documentation.
For instance, here is code for generating a two-sided hypothesis test using the two methods, with the inference()
function first.
inference(y = y_variable, x = x_variable, data = data, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = “theoretical")
obs_diff <- data %>%
specify(dependent ~ independent) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null <- data %>%
specify(dependent ~ independent) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
And here is an example of code creating a 95 percent confidence interval using inference()
and infer
, respectively.
inference(y = response, data = data, statistic = "proportion", type = "ci", method = "theoretical", success = "yes")
data %>%
specify(formula = text_ind ~ NULL, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = .95)
The labs focusing on statistical inference have thus been updated with the current infer syntax in order to provide students with a workflow they can extend outside of their classroom setting alongside the tidyverse, as well as one with available online resources.
A few of the labs also featured datasets that have either been out of use for the past few years and will continue to be considered outmoded, were convoluted and required supplementary information, or intentionally or unintentionally promoted gender stereotypes. These datasets, and the corresponding labs, were updated with more robust and interesting material, as summarized in the table below. The datasets were analyzed prior to insertion to affirm that they would have the bandwidth to seamlessly fit the particular focus of the labs.
Previous Dataset | Reason for Change | New Dataset |
---|---|---|
Body Dimensions | Focuses on weight comparison across binary genders | Fast Food Nutritional Facts |
Atheism and Religion | Unverified Dataset | Youth Risk Behavior Surveillance System |
Baseball | Required further dataset explanation | Human Freedom Index |
North Carolina Births | Dated + Not relevant for most undergrads | Youth Risk Behavior Surveillance System |