$25
Dataset (10 Points)
The training dataset provided contains two training video sequences (A) TUDCampus, (B) TUD-Stadtmitte, and a single testing sequence (C) TUD-Crossing. All videos are provided at fixed resolution of 640x480 with fixed perspectives. Bounding box annotations are provided for sequences A and B. Students are required to parse a ground truth gt.txt in order to extract bounding box information for a specific sequence on a frame by frame basis. In order to accelerate label extraction, a colab notebook is provided. The .txt files contain frame number, bounding box id, bounding box coordinates, and addition values including confidence and 3D coordinates that should not be considered.
Teams may expand their dataset as they see fit. An additional set of videos are made available with bounding box annotations, however, sequences are not captured from a fixed view point and should not be used for tracking. Additional external data may be incorporated at the teams discretion.
Build a classification dataset by extracting patches from the tracking datasets provided. Extract person information using bounding box annotations and select patches on non-persons with minimal intersection with people bounding boxes. Ensure that same person, regardless of frame, does not exist in both training and validation sets. Report class distributions (e.g. how many people vs. non-people are in the images) as well as dataset statistics (e.g. aspect ratios, image sizes). (10 points)
Colab Resource: https://colab.research.google.com/drive/1Hk4xDWz xxSmyOZw-GYhCVQmkL8R XLL6#scrollTo=O4ES7d3v0vn3
4 Classification (35 Points)
The final goal of the classififer will be to classify person and non-person in test images. The system will first learn to differentiate these two classes during training/validation (see K fold class validation section 8.2). To this end, your team is required to extract features from within each image. You are free to use any of the computer vision features discussed in class or others you have found online (e.g. SIFT, SURF, HoG, Bag of Words, etc.). Keep in mind that multiple feature extractors can be combined. That being said, the use of Deep Learning, including CNNs, is strictly prohibited for this task. Once features are extracted, train a Support Vector Machine (SVM) classifier using your computed features and the ground truth labels. Optimize the hyperparameters (e.g., number of features, thresholds, SVM kernel, etc.) for the feature extraction stage and for the SVM classifier. Repeat using a different (non-deep learning) classifier of your choice.
4.1 Classifier evaluation
Train and evaluate your classifiers using K-fold cross-validation (with K=5) – Sec. 8.2. Report all metrics as specified.
1. Average classification accuracy [1] across validations, with the standard deviation. (2.5 points)
2. Average precision [4] and recall [5] across validations. Are these values consistent with the accuracy? Are they more representative of the dataset? In what situations would you expect precision and recall to be a better reflection of model performance than accuracy? (5 points)
4.2 Classification Grading
For the report, describe your approach to the problem and elaborate on the design decisions. For example, discuss which features were extracted, how were your hyperparameters selected, etc. Include the metrics listed in Section 4.1. How was the data divided, and how did you perform cross-validation? Discuss the methods in detail; your goal is to convince the reader that your approach is performing the way you claim it does and that it will generalize to similar data. The grading for the submission is divided as follows:
1. Explanation of feature extraction method including any pre-processing done on the images (ex. resizing). (5 points)
2. Explanation of how the feature extraction parameters were selected. (5 points)
3. Description of cross-validation method. (5 points)
4. Evaluation of performance and interpretation of results for SVM classifier. (from Section 4.1). (7.5 points)
5. Evaluation of performance and interpretation of results for a different (non-deep learning) classifier of your choice. (from Section 4.1). (7.5 points)
6. Display 5 examples per class with their ground truth class labels and predicted class label from your training images in the report. (5 points)
5 Detection and Localization (25 Points)
The classification dataset was created using image patches of the localization set. The goal for this section of the project is to detect and generate bounding boxes for all instances of foreground objects of interest, namely, people in a frame. This is accomplished through the use of a localizer. Implement a nondeep learning localizer. The method of localization is left up to the team, however, you may use the classifier you trained in the previous section.
5.1 Detector and Localizer Evaluation
Evaluate your detector and localizer on K-fold validation. Compute the IoU [3] for the predicted vs. true bounding boxes (for multiple boxes in one image, match the boxes to maximize the mean IoU). Report the distribution of IoU coefficients over your validation sets.(5 points)
5.2 Detection and Localization Grading
Describe your approach to the problem and the method from the input images to the set of output bounding boxes. Include the metrics listed in Section 5.1, and interpret the results. The grading for the report is divided as follows:
1. Description of the contents of the dataset (e.g., number of samples and bounding box size for each label, contents, etc.) (5 points)
2. Description of detection and localization method. (10 points)
3. Evaluation of detection and localization performance and interpretation of results. (Section 5.1). (10 points)
6 Tracking (30 Points)
Once localized, it is possible to track people frame by frame. Your team will implement an object tracker to update a localization estimate using optical flow. Repeat the process with any other method of choice.
6.1 Tracker Evaluation
Select one of the training videos. Localize at least three instances of a person in the initial frames of a video sequence. Track the instances for as many frames as possible or until they have left the frame. Feel free to localize new instances appearing in the frames.
Evaluate your tracker. Compute the mean IoU [3] for the predicted vs. true bounding boxes over the video sequence. Report the distribution of IoU over your validation sets. (5 points)
6.2 Tracking Grading
1. Description of tracking methods (10 points)
2. Evaluation of tracking performance and interpretation of results. (Section 5.1). (5 points)
3. Run the algorithm on the test video and annotate bounding boxes for both implementations of the tracker. Submit the videos as additional files separate from the report. (15 points)
7 Bonus Points (10 Points)
In addition to the non-deep learning framework developed, a bonus of 10 points will be given for deep learning implementations which complete either classification or localization tasks. The details of the implementation are left to the groups, but the goals are the same as in Sections 4.1, 5.1. The submission should include following:
1. Schematic of architecture. (3 point)
2. Description of training. (2 points)
3. Evaluation of performance (as described in the relevant task’s section). (3 point)
4. Performance comparison to non deep learning method chosen. (2 point)
Figure 2: Schematic illustration of the measuring the intersection errors for calculating the intersection over union similarity coefficient (IoU).
8 Appendix
8.1 Definitions
1. Accuracy: The number of correct predictions divided by the total predictions.
2. Cross-Validation: Method of evaluating how well a model will generalize; evaluates how sensitive a model is to the training data. See Section
8.2.
3. IoU: Intersection over union for predicted and true labels; . See Fig. 2.
4. Precision: True positives divided by true positives and false positives:
. ”Of the predictions made for class C, what fraction was correct?”
5. Recall: True positives divided by true positives and false negatives; ”Of the samples for class C, how many were correctly predicted?”
8.2 Cross-Validation
Cross-validation is the partitioning of the available dataset into two sets called training set and validation set. A model is trained on the training set and evaluated on validation set. The process is repeated on several, different partitions of the data, and the model’s performance across the partitions indicate the model’s ability to generalize to new datasets.
A widely used method for cross-validation is k-fold cross-validation. First, the available dataset is randomly partitioned into k equal subsets. One subset is selected for validation, and the remaining k-1 subsets are used to train the model. Each subset serves as validation set exactly once. The performance on the k validation subsets form a distribution which is used to evaluate the model’s ability to generalize.
References
[1] L. Leal-Taix´e et al. “MOTChallenge 2015: Towards a Benchmark for MultiTarget Tracking”. In: arXiv:1504.01942 [cs] (Apr. 2015). arXiv: 1504.01942. url: http://arxiv.org/abs/1504.01942.