$25
Today we’ll be using the Weekly dataset from the ISLR package. This data is similar to the Smarket data from class. The dataset contains 1089 weekly returns from the beginning of 1990 to the end of 2010. Make sure that you have the ISLR package installed and loaded by running (without the code commented out) the following:
# install.packages("ISLR") library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.2
We’d like to see if we can accurately predict the direction of a week’s return based on the returns over the last five weeks. Today gives the percentage return for the week considered and Year provides the year that the observation was recorded. Lag1 - Lag5 give the percentage return for 1 - 5 weeks previous and Direction is a factor variable indicating the direction (‘UP’ or ‘DOWN’) of the return for the week considered.
Part 1: Visualizing the relationship between this week’s returns and the previous week’s returns.
1. Explore the relationship between a week’s return and the previous week’s return. You should plot more graphs for yourself, but include in the lab write-up a scatterplot of the returns for the weeks considered (Today) vs the return from two weeks previous (Lag2), and side-by-side boxplots for the lag one week previous (Lag1) divided by the direction of this week’s Reuther (Direction).
Returns
Two Weeks Ago
Returns
Direction
Part 2: Building a classifier
Recall the KNN procedure. We classify a new point with the following steps: – Calculate the Euclidean distance between the new point and all other points.
– Create the set Nnew containing the K closest points (or, nearest neighbors) to the new point.
– Determine the number of ‘UPs’ and ‘DOWNs’ in Nnew and classify the new point according to the most frequent.
2. We’d like to perform KNN on the Weekly data, as we did with the Smarket data in class. In class we wrote the following function which takes as input a new point (Lag1new,Lag2new) and provides the KNN decision using as defaults K = 5, Lag1 data given in Smarket$Lag1, and Lag2 data given in Smarket$Lag2. Update the function to calculate the KNN decision for weekly market direction using the Weekly dataset with Lag1 - Lag5 as predictors. Your function should have only three input values: (1) a new point which should be a vector of length 5, (2) a value for K, and (3) the Lag data which should be a data frame with five columns (and n rows).
KNN.decision <- function(Lag1.new, Lag2.new, K = 5, Lag1 = Smarket$Lag1, Lag2 = Smarket$ n <- length(Lag1)
stopifnot(length(Lag2) == n, length(Lag1.new) == 1, length(Lag2.new) == 1, K <= n)
dists <- sqrt((Lag1-Lag1.new)^2 + (Lag2-Lag2.new)^2) neighbors <- order(dists)[1:K] neighb.dir <- Smarket$Direction[neighbors] choice <- names(which.max(table(neighb.dir))) return(choice)
}
Lag2) {
3. Now train your model using data from 1990 - 2008 and use the data from 2009-2010 as test data. To do this, divide the data into two data frames, test and train. Then write a loop that iterates over the test points in the test dataset calculating a prediction for each based on the training data with K = 5. Save these predictions in a vector. Finally, calculate your test error, which you should store as a variable named test.error. The test error calculates the proportion of your predictions which are incorrect (don’t match the actual directions).
4. Do the same thing as in question 3, but instead use K = 3. Which has a lower test error?
Part 3: Cross-validation
Ideally we’d like to use our model to predict future returns, but how do we know which value of K to choose? We could choose the best value of K by training with data from 1990 - 2008, testing with the 2009 - 2010 data, and selecting the model with the lowest test error as in the previous section. However, in order to build the best model, we’d like to use ALL the data we have to train the model. In this case, we could use all of the Weekly data and choose the best model by comparing the training error, but unfortunately this isn’t usually a good predictor of the test error.
In this section, we instead consider a class of methods that estimate the test error rate by holding out a
(random) subset of the data to use as a test set, which is called k-fold cross validation. (Note this lower case k is different than the upper case K in KNN. They have nothing to do with each other, it just happens that the standard is to use the same letter in both.) This approach involves randomly dividing the set of observations into k groups, or folds, of equal size. The first fold is treated as a test set, and the model is fit on the remaining k − 1 folds. The error rate, ERR1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a test set. This process results in k estimates of the test error: ERR1, ERR2, ..., ERRk. The k-fold CV estimate of the test error is computed by averaging these values,
k
1 X
CV(k) = ERRk.
k
i=1
We’ll run a 9-fold cross-validation in the following. Note that we have 1089 rows in the dataset, so each fold will have exactly 121 members.
5. Create a vector fold which has n elements, where n is the number of rows in Weekly. We’d like for the fold vector to take values in 1-9 which assign each corresponding row of the Weekly dataset to a fold. Do this in two steps: (1) create a vector using rep() with the values 1-9 each repeated 121 times (note 1089 = 121 · 9), and (2) use sample() to randomly reorder the vector you created in (1).
6. Iterate over the 9 folds, treating a different fold as the test set and all others the training set in each iteration. Using a KNN classifier with K = 5 calculate the test error for each fold. Then calculate the cross-validation approximation to the test error which is the average of ERR1, ERR2, ..., ERR9.
7. Repeat step (6) for K = 1, K = 3, and K = 7. For which set is the cross-validation approximation to the test error the lowest?