Chapter 4 Methods

Since 98 percent of the projects were submitted either as an R script or R Markdown document, the student project code was analyzed directly on the downloaded submission documents for each group. Each project was examined and scored as either a 0, if the attribute was missing, or 1, if it was present, for 16 variables. The remaining three covariates identified the student projects by grade, index, and class.

Due to privacy concerns, each student project was provided with an index, and the names of the students in each group were removed from their submission document. For the dataset utilized in this thesis, projects can solely be identified by their assigned index. A separate dataset serves as a link between the individual projects and their titles.

The first few rows of the dataset compiled in this project are available below.

head(project)

# A tibble: 6 x 22
  index grade sem   r_rmd tidyverse create_new_var change_var sub_analysis
  <chr> <dbl> <chr> <chr> <chr>     <chr>          <chr>      <chr>       
1 1      87.1 Fall… .r    base R    no             no         yes         
2 2      89.2 Fall… .r    base R    yes            no         yes         
3 3      80.2 Fall… .r    base R    no             no         yes         
4 4      87.2 Fall… .r    base R    yes            no         yes         
5 5      80.4 Fall… .r    base R    no             no         yes         
6 7      90.2 Fall… .r    base R    no             no         no          
# … with 14 more variables: sub_data <chr>, viz_mult_make <chr>,
#   viz_mult_interpret <chr>, eda_theme <chr>, rel_data <chr>,
#   slr_fit <chr>, mlr_fit <chr>, mlr_check_cond <chr>, prediction <chr>,
#   ht <chr>, ht_check_cond <chr>, creative <int>, theme <int>,
#   multiviz <int>

The student projects were not compiled into PDF or HTML files to confirm that the code worked, since it was a near-impossible task to determine which version of R packages the students utilized, as some of these commands are now defunct in the most recent versions of the packages. Because of this decision, this analysis operates under the assumption that the code produced the desired results in each project and did not require further debugging.

The contents of the student project code were still analyzed for clarity, as well as creativeness, depth, and multivariate visualizations through the 16 indicator variables. These specific metrics were created since they are prominently emphasized throughout the GAISE. Code snippets will be provided in the following chapter to display examples that scored a one for distinct covariates.

index: Project ID
grade: Score on final project
sem: Semester course taken
r_rmd: Was the submission an R script (with prose of the project turned in as a Word document) or was the submission an R Markdown file?
tidyverse: Project used "tidyverse" or "base R" syntax
create_new_var: Students created a new variable based on existing variables, "yes" or "no"
change_var: Students changed existing variables, "yes" or "no"
sub_analysis: Students performed a subgroup analysis, "yes" or "no"
sub_data: Students used data subsets for the entire project, "yes" or "no"
viz_mult_make: Students employed visualizations with at least three variables, "yes" or "no"
viz_mult_interpret: Students properly interpreted their 3+ variable visualization, "yes" or "no"
eda_theme: Students used a consistent theme throughout their project, "yes" or "no"
rel_data: Students supplemented their project theme with relevant data, "yes" or "no"
slr_fit: Students fitted a simple linear regression, "yes" or "no"
mlr_fit: Students fitted a multiple linear regression, "yes" or "no"
mlr_check_cond: Students properly checked the conditions for their multiple linear regression, "yes" or "no"
prediction: Students used their multiple linear regression to predict a movie’s audience score, "yes" or "no"
ht: Students performed a hypothesis test, "yes" or "no"
ht_check_cond: Students correctly checked the conditions for their hypothesis test, "yes" or "no"

4.1 Creativity

The creativity metric seeks to encapsulate anything students coded that was not directly specified in the instructions but provided a purpose in their projects. The metric’s possible scores range from 0 to 4, as a project was scored with a single point for each of the following:

Creation of new variable(s) based on existing variables
Transformation of existing variables
Existence of a subgroup analysis
The use of a subset of the dataset for all steps of the project

In the case of the student group projects utilizing the tidyverse syntax, groups were still given scores of 1s if they satisfied these conditions in base R form. While rare, two groups in labs taught in the tidyverse created or transformed covariates using Base R syntax, which was likely due to alternative resources, such as Stack Overflow, that prioritized base R solutions. Now, though, as the tidyverse’s popularity continues to grow, more online resources incorporate and promote tidyverse solutions.

4.1.1 Creation of New Variable(s)

The creation of a new variable(s) is defined as any data manipulation throughout the EDA process where student groups compose a previously non-existing covariate. As an example, one group created a new variable tracking if a movie had won any of the following awards: best picture, best actor, best actress, or best director, and that project had this variable coded as “yes.” In order to score a 1, the student project also had to utilize the new variable within an aspect of their analysis. This condition filtered for groups that created unnecessary covariates. However, a score of 1 would be valid if the group did not use the variable in the inference or regression sections, but did explore the covariate in their EDA.

4.1.2 Transformation of Existing Variables

Although related to the above covariate, the transformation of existing variables did not qualify as creating new variables, or vice versa. In this situation, a project would score a 1 if the student group mutated a variable already existing within the dataset, generally to highlight certain cases. For instance, a few project groups decided to change mpaa_rating to either “R” or “Other,” if the movie was not rated R. Similar to the requirements for scoring a 1 for the creation of new variable(s) covariate, the mutation was required to be employed to some end, as groups would have to provide at least a cursory analysis of the newly-mutated variable to score a 1.

A distinction between scoring a 1 for this covariate and 1 for subsetting the dataset or conducting a subgroup analysis is that filtering the dataset for just entries that cover a portion of levels within a specific variable would qualify as a part of either a subgroup analysis or data subset, but not this covariate. Also, converting a factor variable that could be potentially read in as one when loading the dataset did not qualify as a mutation of an existing variable for this study.

4.1.3 Existence of Subgroup Analysis

The presence of a subgroup analysis was measured in regards to creativity. Projects that received a one analyzed portions of the data during their EDA process. Groups could use an assortment of commands to satisfy a score of a 1, such as a normal boxplot, five-number summary of a specific variable within the movies dataset, or a subsetting with a corresponding numerical or graphical analysis. As an example, a project receiving a 1 for this category may have analyzed how the audience ratings for R rated movies compared to that of PG-13 movies in their bivariate analysis.

4.1.4 Use of a Data Subset for Project’s Entirety

Although the use of a data subset covariate may seem similar to the one above, this variable received a 1 for a different aspect of the final group projects. Here, student groups are not just using the provided movies dataset for their EDA, inference, and regression—they are intentionally focusing on a few characteristics of the movie dataset. Student projects were not required to employ the same subsetted data throughout the entire analysis, but they did have to analyze related aspects of the movies dataset to qualify for a 1. For example, one student group scrutinized solely PG-13 rated movies for their final project, while another used the PG-13 rated movies subset for the EDA, PG-13 movies released after 2000 for the inference, and then the same PG-13 rated movies subset utilized in the EDA process for their regression analysis.

4.2 Depth

The depth metric measures the level of depth of the analysis, both in terms of the statistical methods utilized and in terms of story-telling. Since the GAISE advises instructors to focus on students’ comprehension of important basic concepts rather than covering a multitude of topics with little focus, the depth metric qualifies the student groups’ understanding of the subjects covered in the project. The metric ranges from 0-2 and is scored with 1 point for each of the following:

Presence of consistent theme throughout the project
Use of relevant data

The depth metric was created to qualify findings from the creativity score, to confirm that the syntax producing the more creative student projects also were at least of the same quality. Although creativity is imperative in these final projects, student groups also cannot skip parts of the data analysis cycle.

4.2.1 Consistent Theme

In the world of data science, story-telling is such an important aspect, just as story-telling is designed to be for the STA 101 final group projects. A strong final project requires a story: a leading question, initial findings, subsequent analyses, and conclusion, all formed around a specific theme. Although this covariate’s scoring was subjective, the requirements for final projects to score a 1 were similar to those defining the creativity metric. To receive a 1, student groups clearly linked the steps in their analysis, often choosing to focus on a few aspects within the entire movie dataset. For instance, analyzing the impact of movie ratings on audience scores would qualify as a sufficient theme, but merely inspecting an assortment of different predictors with minimal reasoning would not register as 1.

4.2.2 Presence of Relevant Data

Another subjective variable, the presence of relevant data was formed to complement the consistent theme covariate. To receive a 1 for this variable, student groups were required to sufficiently use R to create insights surrounding their chosen theme(s). The covariate addresses the issue that projects may have interesting themes but lack the analysis and coding quality to supplement their project. As an example, an aspect of a group project that would have scored a 1 for this category could have displayed the correlation coefficient between two numerical variables instead of plotting them together and failing to acknowledge the correlation in the project submission. If the majority of the coding could be employed to support the final project, the project group received a 1.

4.3 Multivariate Visualization

The multivariate visualization metric accounts for both the presence and the insights derived from visualizations with at least three variables. Especially when using a movies dataset with many binary variables student groups often analyzed, visualizations with at least three variables and their subsequent interpretations can supplement important insights uncovered in the final projects. Also, the GAISE highlights the significance of teaching students how to interpret multivariate visualizaitons. “When students leave an introductory course, they will likely encounter situations within their own fields of study in which multiple variables relate to one another in intricate ways. We should prepare our students for challenging questions that require investigating and exploring relationships among more than two variables (Carver et al., 2016).” The metric ranges from 0-2 and is scored with 1 point for each of the following:

Presence of a visualization with 3+ variables
Interpretation of the multivariate visualization

Although the two variables that constitute the multivariate visualization metric are related, as a project could not score a 1 for the interpretation if it did not contain a multivariate visualization, the presence of the visualization did not imply that there was a useful interpretation in the project write-up.

4.3.1 Presence of a Visualization with 3+ Variables

The presence of a visualization with at least three variables is an objective variable simple to when dissecting final project submissions. Groups nearly always utilized colors to display a third variable along with two numerical ones on the x- and y-axes. To receive a score of a 1, projects were required to produce a graphical output with at least three aesthetics. For instance, some student groups created scatterplots between the critic and audience scores for all the movies in the given dataset with different colors for movies that won best picture at the Oscar’s that year.

4.3.2 Interpretation of Multivariate Visualization

The interpretation of a multivariate visualization is not a completely objective variable. A project containing an incorrect or insufficient interpretation of its multivariate visualization would not receive a score of a 1. By insufficient, the project does not need to address every aspect of the visualization, but it needs to discuss a key insight to help bolster the overall final project. Otherwise, a useful explanation would deserve a score of a 1.