In this homework, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action, there will be 145 samples. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the amount of data. Consequently, a training sample is an image tuple that forms a 3D volume with one dimension encoding temporal correlation between frames and a label indicating what action it is.
To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.
Instead of training an end-to-end neural network from scratch whose computation is prohibitively expensive, we divide this into two steps: feature extraction and modelling. Below are the things you need to implement for this homework:
{35 pts} Feature extraction. Use any of the pre-trained models
(https://pytorch.org/docs/stable/torchvision/models.html) to extract features from each frame. Specifically, we recommend not to use the activations of the last layer as the features tend to be task specific towards the end of the network. hints:
A good starting point would be to use a pre-trained VGG16 network, we suggest first fully connected layer torchvision.models.vgg16 (4096 dim) as features of each video frame. This will result into a
4096x25 matrix for each video.
Normalize your images using torchvision.transforms
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.2
24, 0.225]) prep = transforms.Compose([ transforms.ToTensor(), normalize ]) prep(img)
The mean and std. mentioned above is specific to Imagenet data
More details of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
(http://pytorch.org/tutorials/beginner/data_loading_tutorial.html)
{35 pts} Modelling. With the extracted features, build an LSTM network which takes a dx25 sample as input (where d is the dimension of the extracted feature for each frame), and outputs the action label of that sample.
{20 pts} Evaluation. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy (5 points). The baseline test accuracy for this data is 75%, and 10 points out of 20 is for achieving test accuracy greater than the baseline. Moreover, you need to compare (5 points) the result of your network with that of support vector machine (SVM) (stacking the dx25 feature matrix to a long vector and train a SVM).
{10 pts} Report. Details regarding the report can be found in the submission section below.
Notice that the size of the raw images is 256x340, whereas your pre-trained model might take nxn images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five nxn images, one at the image center and four at the corners and compute the ddim features for each of them, and average these five d-dim feature to get a final feature representation for the raw image. For example, VGG takes 224x224 images as inputs, so we take the five 224x224 croppings of the image, compute 4096-dim VGG features for each of them, and then take the mean of these five 4096-dim vectors to be the representation of the image.
In order to save you computational time, you need to do the classification task only for the first 25 classes of the whole dataset. The same applies to those who have access to GPUs. Bonus 10 points for running and reporting on the entire 101 classes.
Dataset
Download dataset at UCF101 (http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_images.tar)(Image data for each video) and the annos folder which has the video labels and the label to class name mapping is included in the assignment folder uploaded.
UCF101 dataset contains 101 actions and 13,320 videos in total.
annos/actions.txt
lists all the actions ( ApplyEyeMakeup , .., YoYo )
annots/videos_labels_subsets.txt
lists all the videos ( v_000001 , .., v_013320 ) labels ( 1 , .., 101 )
subsets ( 1 for train, 2 for test) images/
each folder represents a video
the video/folder name to class mapping can be found using annots/videos_labels_subsets.txt , for e.g. v_000001 belongs to class 1 i.e. ApplyEyeMakeup each video folder contains 25 frames