Chapter 3 Data

3.1 Final Project

The final project simulates a complete data analysis using relevant techniques covered throughout the STA 101 curriculum. The final project is the second of two that groups complete, but it differs drastically from the frst. Whereas the first project focuses on statistical inference and the dataset description, the second one is more comprehensive in its requirements, with specifications for exploratory data analysis (EDA) and regression sections, as well as for more innovative ways of looking at the data to address the final project’s directives. In the final project, student groups simulate a task in an R script or R Markdown document at a new music studio where their boss hypothetically assigns two goals: the boss wants to learn about the attributes that make a movie popular and also something new about movies.

The final assignment contains five components—an introduction, a univariate analysis, a bivariate analysis, a multiple regression for predicting audience scores, and a conclusion. The regression section is relatively constant amongst groups in determining an optimal regression. However, in the univariate and bivariate analyses, student groups have the flexibility to explore a variety of facets of the data, which often separates projects from one another.

The student project dataset has remained nearly the same for each class, and it tracks a random sample of American movies released since 1970. The student project dataset contains between 25 and 32 variables summmarizing the movie’s general characteristics such as runtime, genre, and production studio, award trackers such as best picture, best actor, best actress and best director indicator variables, as well as data from an online film review website (Rotten Tomatoes) and an online movie database (IMDB). Although variables such as the producing studio, the month and day of the week of both the theatre and DVD releases, IMDB rating (out of 10) and audience rating on Rotten Tomatoes were not included in the original 2013 dataset, student groups were provided with a sufficient amount of potential variables to analyze. The student project dataset’s most recent codebook is available in the Appendix as part of the Spring 2016 porject assignment.

3.2 Student Groups

Students remained in the same groups they were assigned at the beginning of the semester as the groups were formed based on the results of both a pre-test and a survey. Both the pre-test and the survey were geared toward understanding each individual’s statistical background and literacy upon entering the course. Students worked through the lab assignments together, which consisted of exercises in R, and were encouraged to study together for exams as well.

Student-based learning has been supported by studies showing that students absorb more when placed in smaller groups compared to when they worked individually. This may be due to increased collaboration, with debate amongst the team members that usually reinforces their understandings of the concepts covered in the course. Especially because of the large class sizes of STA 101, which total at least 70 students, professors utilize team-based learning techniques to help the students learn outside of a large lecture situation.

3.3 Dataset Background

The dataset has been manually compiled from 205 STA 101 final group projects spanning the 2013-2016 academic years. 13 data entries were discarded since the complete project submissions could not be recovered. The projects’ contents are located on Duke’s Sakai sites of seven different classes, six of which were taught by Dr. Çetinkaya-Rundel. However, the professor does not affect much of the R learning over the course of the semester, since it occurs within the teaching assistant-run labs. Although there may be some slight instructor effects on coding abilities, the teaching assistants are the primary educators of R, and they change each semester. Therefore, some sections may have stronger grasps of coding for the final project, but it is likely to be minimal since teaching assistants are apprised of the upcoming week’s content via weekly meetings with the professor. Regardless, some discrepancies may be due to teaching assistants, and are noted in the Results section.

The dataset constructed in this paper primarily focuses on actions taken by groups within the univariate and bivariate portions of the projects. In doing so, the the dataset contains variables summarizing relative creativity measures, as well as the projects’ depth and level of multivariate visualizations. Variable explanations will be available in the following chapter.