$25
Project 2: Walmart Stores Forcasting
You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments. The goal is to predict the future weekly sales for each department in each store based on the historical data.
Source
You can find the data (only train.csv), relevant information, and some sample code on Kaggle (https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting). Note that ONLY the training data is used in this project and our evaluation procedure is different from the one on Kaggle.
Name your main file as mymain.R. If you have multiple R files, upload the zip file. Our evaluation code looks like the following:
library(tidyverse)
source("mymain.R")
# read in train / test dataframes
train <- readr::read_csv('train.csv')
test <- readr::read_csv('test.csv', col_types = list(
Weekly_Pred1 = col_double(),
Weekly_Pred2 = col_double(),
Weekly_Pred3 = col_double()
))
# save weighted mean absolute error WMAE
num_folds <- 10
wae <- tibble(
model_one = rep(0, num_folds),
model_two = rep(0, num_folds),
model_three = rep(0, num_folds)
)
# time-series CV
for (t in 1:num_folds) {
# *** THIS IS YOUR PREDICTION FUNCTION ***
mypredict()
# Load fold file
# You should add this to your training data in the next call
# to mypredict()
fold_file <- paste0('fold_', t, '.csv')
new_test <- readr::read_csv(fold_file)
# extract predictions matching up to the current fold
scoring_tbl <- new_test %>%
left_join(test, by = c('Date', 'Store', 'Dept'))
# compute WMAE
actuals <- scoring_tbl$Weekly_Sales
preds <- select(scoring_tbl, contains('Weekly_Pred'))
weights <- if_else(scoring_tbl$IsHoliday.x, 5, 1)
wae[t, ] <- colSums(weights * abs(actuals - preds)) / sum(weights)
}
# save results to a file for grading
readr::write_csv(wae, 'Error.csv')
· train.csv: 5 columns ("Store", "Dept", "Date", "Weekly_Sales", "IsHoliday"), same as the train.csv file on Kaggle but ranging from 2010-02 to 2011-02.
· test.csv: 7 columns ("Store", "Dept", "Date", "IsHoliday", "Weekly_Pred1", "Weekly_Pred2", "Weekly_Pred3"), in the same format as the train.csv file on Kaggle ranging from 2011-03 to 2012-10 with the last three columns being zero.
· fold_1.csv, ..., fold_10.csv: 5 columns ("Store", "Dept", "Date", "Weekly_Sales", "IsHoliday"), same as the train.csv file on Kaggle, and one for every two months starting from 2011-03 to 2012-10.
· In your "mypredict" function, save the result from your three prediction models in the corresponding rows in the data set "test".
· The evaluation metric is the same as the one described on Kaggle.
The evaluation process for Python code is similar.
You are required to build three prediction models. Always include a simple model, a model that doesn't require much training. For example, predict sales for the next month by some average based on the previous month or months.
Frequently Asked Questions
· Will you give us the training and test dataset?
The train.csv and test.csv for our evaluation are generated from the training data on Kaggle using the following code. (You can download "train.csv.zip" from Kaggle or from the Resources page.)
· library(lubridate)
· library(tidyverse)
·
· # read raw data and extract date column
· train_raw <- readr::read_csv(unz('train.csv.zip', 'train.csv'))
· train_dates <- train_raw$Date
·
· # training data from 2010-02 to 2011-02, i.e. one year
· start_date <- ymd("2010-02-01")
· end_date <- start_date %m+% months(13)
·
· # split dataset into training / testing
· train_ids <- which(train_dates >= start_date & train_dates < end_date)
· train = train_raw[train_ids, ]
· test = train_raw[-train_ids, ]
·
· # write the training results to a file
· readr::write_csv(train, 'train.csv')
·
· # Create the test.csv
· # Removes weekly sales and adds model pred columns.
· test %>%
· select(-Weekly_Sales) %>%
· mutate(Weekly_Pred1 = 0, Weekly_Pred2 = 0, Weekly_Pred3 = 0) %>%
· readr::write_csv('test.csv')
·
· # create 10-fold time-series CV
· num_folds <- 10
· test_dates <- train_dates[-train_ids]
·
· # month 1 --> 2011-03, and month 20 --> 2012-10.
· # Fold 1 : month 1 & month 2, Fold 2 : month 3 & month 4 ...
· for (i in 1:num_folds) {
· # filter fold for dates
· start_date <- ymd("2011-03-01") %m+% months(2 * (i - 1))
· end_date <- ymd("2011-05-01") %m+% months(2 * (i - 1))
· test_fold <- test %>%
· filter(Date >= start_date & Date < end_date)
·
· # write fold to a file
· readr::write_csv(test_fold, paste0('fold_', i, '.csv'))
· }
·
· What do we need to do in "mypredict()", a function that takes no input and produces no output either?
In R, variables like train, test, and t are global parameters for "mypredict", so your predict function can access them, and even change their values using "<<-" for assigning values.
· test=function(){
· print(x^2)
· x <<- 2*x
· }
·
· x=3
· test()
· x
·
Sourcing mymain.R basically loads in the function "mypredict". When running mypredict() for each t, you need to do the following:
o If t > 1, append new_test to training data;
o Update your model with the new training data, or only update your model periodically with enough new training data (up to you);
o apply your current model to fill in the last three columns of "test" for the t-th two month period.
· Will you give us some materials of dealing with time series data sets?
Check Walmart_Sample_Code.html on the Resouces page.
R package [forecast] is designed for time series data. It's related to the stl package used in Walmart_Sample_Code.html.
On the other hand, if we create some features to describe the history at time t, e.g., x_t is a two-dimensional feature vector denoting the sales from the previous two weeks, then we can use linear regression models.
· Some depts, like dept 99 of some stores, doesn't have any value in the first year. How should we predict without data, and how will you evaluate our prediction on missing values within the original training data?
You can go through the discussion forum on Kaggle to check how others handle the problem of prediction with missing history. The simplest solution is to predict it to be zero, or some kind of average (e.g., store average). Check Walmart_Sample_Code.html.
Evaluation with missing data: if an observation is missing in 2010-03, we will skip that observation in the evaluation.