$25
1. Dataset Links
(a) Train Images: Link
(b) Train Captions: Link
(c) Public Test Images: Link
(d) Public Test Captions: Link
(e) Private Test Images: Coming Soon
2. Non Competitive Part
You are given a dataset of images with 5 possible captions for each image. The images are named as image id.jpg where id is the image index. The captions.tsv file contain the captions of all the images, with each line (tab separated) format having image id and the 5 corresponding captions. You will use the Encoder-Decoder architecture for modelling this problem. An encoder is used to encode the input into a vector representation and a decoder is applied on this vector representation to generate the sequence auto-regressively (one word at a time). You have to implement the encoder for the images and a decoder for text captions in this part along with beam search to find the most optimal sequence. You can use this as a starting point.
(a) Encoder: Design a CNN based encoder that handles the variable sized images.
1
(b) Decoder: Design a RNN / LSTM based decoder which generates the captions given the encoded image input.
(c) Training Setup: Use cross-entropy as the loss function and teacher-forcing for training the decoder. Don’t forget to use START and END tokens to allow variable length caption outputs in the decoder.
(d) Inference at test time: Instead of using the token with maximum probability at each step (Greedy Decoding) for generating the tokens (words) in the caption, you will use Beam Search for generating your captions. It is a Dynamic Programming method to get better sequences than simple Greedy Decoding. Read Section 4 of this pdf document. We also recommended you to read Sections 1.2, 1.3, and 1.4 for better understanding.
(e) Evaluation: You have to generate the top-5 captions for each image using Beam Search as described above. We will use BLEU (BiLingual Evaluation Understudy) scores for evaluating your model performance. You can read more about it in Section 5.3 of this pdf document.