$30
The purpose of this assignment is to make sure every student has a fundamental understanding of the topic of their tutorial and has the same foundation for the next tutorials. For each tutorial, you will receive a different survey paper which gives an overview to the state of the art on (a subset of) the topic of your tutorial. You can find the survey paper on Wattle, under Tutorial Material, folder
“Assignment 1”
The assignment consists of two different parts. Part 1 are the general questions which are the same for every tutorial. Part 2 are paper specific questions. You should be able to answer all questions after carefully reading the survey paper. We assumed that it should take you about 7.5-10h to complete the assignment.
Full reference of the survey paper: Zhang, D., Cao, R., & Wu, S. (2019). Information Fusion in Visual Question Answering: A Survey. Information Fusion, 52, 268-280.
https://doi.org/10.1016/j.inffus.2019.03.005
Note: Note: Short and precise answers are preferred. Answer in your own words. Please do not exceed around 250 words per question.
Part 1: General Questions (7.5 marks)
1. What is the branch in the survey paper you find most interesting and why? (1 mark)
2. Write a summary of the branch that you pick in your own words (maximum 500 words, 2 marks)
3. What are the three papers you would read next if you were to do a research project on that branch. Please explain why you would pick these papers and give their full references. (1.5 marks)
4. Find and list at least 2 research groups who conduct state-of-the-art research in this topic. Please justify your answer. (1 mark)
5. Name two open research problems in the field of this survey paper and explain why they are hard and interesting. (max 500 words, 2 marks)
Part 2: Paper-specific Questions (7.5 marks)
1. What are the significant steps for an end-to-end VQA model? For each step, what are the possible techniques? (1 mark)
2. Why is the attention mechanism positively effective for VQA models? (1.5 marks)
3. Is the following statement on the attention mechanism correct? Please justify your answer. (1.5 marks)
Outputs of attention layers are individual visual and linguistic feature vectors (i.e. Vi and Vq), which cannot be directly used to generate answers. And the fusion of two feature channels is always needed to get a joint representation no matter whether the two features are attended or not. Therefore, we should regard the attention mechanism acting more like feature extraction than feature fusion in a VQA model.
4. Among all the fusion methods, which one do you think outperforms others? Please justify your answer (1.5 marks)
5. Why is information fusion important to VQA? Is information fusion essential for other visuallinguistic problems? You may explain from either the survey’s perspective or your own opinion. (1.5 marks)
6. After reading through the whole survey paper, can you design a combination of technique series that may result in the best performance? You may refer to Fig.3. in Section 4 for the end-to-end framework. (0.5 marks)