2016 Spring Semester STA 101 Final Project Assignment Document
Listed below is the final project assignment document, which includes a codebook for the movies dataset, given to students who enrolled in the course in the Spring 2016 iteration:
You and your teammates work for Paramount Pictures.
Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies.
She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants your team to figure it all out.
As part of this project you will complete exploratory data analysis (EDA), inference, modeling, and prediction. You have been introduced to some of these concepts already, and you will learn about the others later in the course.
The data can be loaded directly in RStudio using the following command:
load(url("https://stat.duke.edu/~mc301/data/movies.Rdata"))
Data
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. It is up to you to decide which variables are meaningful and which should be omitted. For example information in the the actor1
through actor5
variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.
You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions.
When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.
Codebook
title
: Title of movie
title_type
: Type of movie (Documentary, Feature Film, TV Movie)
genre
: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, 1. Horror, Mystery & Suspense, Other)
runtime
: Runtime of movie (in minutes)
mpaa_rating
: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio
: Studio that produced the movie
thtr_rel_year
: Year the movie is released in theaters
thtr_rel_month
: Month the movie is released in theaters
thtr_rel_day
: Day of the month the movie is released in theaters
dvd_rel_year
: Year the movie is released on DVD
dvd_rel_month
: Month the movie is released on DVD
dvd_rel_day
: Day of the month the movie is released on DVD
imdb_rating
: Rating on IMDB
imdb_num_votes
: Number of votes on IMDB
critics_rating
: Categorical variable for critics rating on Rotten Tomatoes 1. (Certified Fresh, Fresh, Rotten)
critics_score
: Critics score on Rotten Tomatoes
audience_rating
: Categorical variable for audience rating on Rotten Tomatoes 1. (Spilled, Upright)
audience_score
: Audience score on Rotten Tomatoes (response variable)
best_pic_nom
: Whether or not the movie was nominated for a best picture 1. Oscar (no, yes)
best_pic_win
: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win
: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win
: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win
: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box
: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director
: Director of the movie
actor1
: First main actor/actress in the abridged cast of the movie
actor2
: Second main actor/actress in the abridged cast of the movie
actor3
: Third main actor/actress in the abridged cast of the movie
actor4
: Fourth main actor/actress in the abridged cast of the movie
actor5
: Fifth main actor/actress in the abridged cast of the movie
imdb_url
: Link to IMDB page for the movie
rt_url
: Link to Rotten Tomatoes page for the movie
Stages of the project
You will complete this project in two stages:
- Stage 1: Proposal (25 points)
- Stage 2: Poster and presentation (75 points)
The remainder of this document outlines the requirements and expectations for both stages of the project. You should read the entire document before getting started. The requirements and expectations for Stage 1 will only make sense in context of those for Stage 2.
Stage 1: Proposal (25 points)
Content
Your proposal should contain the following:
Data: (2 points) Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).
Research questions: (6 points) Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. Note that you will have the option to update / revise / change these questions for your poster at the end of the semester.
EDA: (9 points) Perform exploratory data analysis that adresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Timeline: (4 points) Sketch out a timeline for the work you will do to complete this project. Be as detailed and precise as possible. And be realistic – discuss course schedules, travel plans, etc.
Teamwork: (4 points) Describe in detail how you will divvy up the work between team members and what aspects of the project you will complete together as a team. Note that during the poster session each member needs to be able to answer questions about all aspects of the work, regardless of whether they took the lead on that section or not.
Grading
Your proposal will be graded out of 25 points (as outlined above), and will make up 25% of your overall project score.
The following will result in deductions:
- Late: -1 points for each day late
- Reproducibility issues, requiring to make changes to the R Markdown file to knit the document: -3 points
- Each page over limit: -2 points per page (view print preview to confirm length)
Stage 2: Poster and presentation (75 points)
Content
Introduction: Outline your main research question(s).
EDA: Do some exploratory data analysis to tell an “interesting” story about movies. Instead of limiting yourself to relationships between just two variables, broaden the scope of your analysis and employ creative approaches that evaluate relationships between two variables while controlling for another.
Inference: Use one of your research questions (or come up with a new one depending on feedback from the proposal) that can be answered with a hypothesis test or a confidence interval, e.g. “Is there a difference in mean audience scores between genres?” or “What is the average difference in audience scores between movies that do and do not feature without oscar winner actors?” This question could be used to shed some light on your choice of the “best” linear model.
Carry out the appropriate inference task to answer your question.
Modeling: Develop a multiple linear regression model to predict a numerical variable in the dataset.
Prediction: Pick a movie from 2015 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed (and the predict
function in R). Also quantify the uncertainty around this prediction using an appropriate interval.
Conclusion: A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question(s). You should also discuss any shortcomings of your current study (either due to data collection or methodology) and include ideas for possible future research.
Poster format & length
Poster: We suggest using a tri-fold poster. You can organize it however you like. You do not need to get your poster professionally printed. You can see sample posters from previous years at your professor’s office.
R Markdown: All code used to generate the statistics and plots on your poster should be organized and submitted in an R Markdown document.
Download the template for the project:
download.file("http://stat.duke.edu/courses/Spring16/sta101.001/rmd/sta101_project.Rmd", destfile = "sta101_project.Rmd")
There is no length limit for this document.
Grading
Your poster (and accompanying code) and presentation will be graded out of 75 points, and will make up 75% of your overall project score.
Grading of the project will take into account:
- Correctness: Are the procedures and explanations correct?
- Presentation: What was the quality of the presentation and poster?
- Content/Critical thought: Did your think carefully about the problem?
- Tidyness: Is your code organized well?
Your team scores will be based on the following components:
- 25 points - poster
- 20 points - presentation
- 20 points - code
- 10 points - classmates’ evaluation
Submission
Online on Sakai under Assignments. These will be time stamped, and late penalty will be applied based on the time stamp. Only one submission per team required.
- R Markdown file (.Rmd)
- HTML output (.html)
We will download your R Markdown file and run your code to confirm reproducibility of your work. Grading will be based on the document we compile, so make sure that your R Markdown file contains everything necessary to compile your entire work.
Teamwork and grading
Team scores for both the proposal and the poster will be adjusted based on team peer evaluation data to determine each student’s indivudual grade. You will be asked to fill out a survey where you rate the contribution of each team member. Filling out the survey is a prerequisite for receiving a project score.
All team members must be present at the poster session. Failure to do so will result in a 0 on the project for the absent team member.
Note that each student must complete the project and score at least 30% of total possible points on the project in order to pass this class.
Honor code
You may not discuss this project in any way with anyone outside your team, besides the professor and TAs. Failure to abide by this policy will result in a 0 for all teams involved.
Tips
This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.
You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.