Starting from:

$25+

S670 Problem set 4 Solved

Get the data
Download the following two files from https://datasets.imdbws.com/: title.ratings.tsv.gz and title.basics.tsv.gz. (You may need to download software to unzip the file, e.g. 7Zip.) The unzipped files are in tab separated value form. (Warning: the latter file is about half a gigabyte unzipped.)

Unzip the files and read them into R. Using read tsv or read delim is recommended, e.g.: read_tsv("title.ratings.tsv", na = "\\N", quote = ’’)

(Those are supposed to be straight quotes. You’ll get a few tens of thousands of warnings, but these pertain to the variable endYear, which is irrelevant for our purposes.) Merge the two data sets by the unique identifier tconst. We only want movies, so only keep data for which titleType is “movie.” After doing this, I ended up with about 240,000 movies.

Questions
1.    Fit a model to predict a movie’s IMDB rating (variable averageRating) by year (startYear) and length (runtimeMinutes.) You will have to make a number of modeling choices:

(a)    Do you need any transformations?

(b)    Should you fit a linear model or something curved?

(c)    Is an additive model adequate?

(d)    Do you need to filter out or downweight tail values to prevent the fit from being dominated by outliers?

(e)    Should you weight by number of votes?

Some of these choices are clear-cut, while others will be a matter of preference. You must justify all your choices. You’ll be graded on the justification, not the choice (unless the choice is really bad.)

Note that computational concerns will also drive your modeling choices. For example, you will not be able to put all the data into a loess unless you have a supercomputer.

More products