Starting from:

$24.99

MSBD50020 Assignment 2- Knowledge Discovery and Data Mining Solution


2. Attachments should be named in the format of: A2 itsc stuid.zip which includes
• A2 itscstuid report.pdf/.docx: Please put all your reports in this file. (Attachments should be original .pdf or .docx, NOT compressed)
• A2 itscstuid code.zip: The zip file contains all your source codes for the assignment.
• A2 itscstuid Q1 code: this is a folder that should contain all your source code for Q1.
• A2 itscstuid Q2 code: same as above.
4. For programming language, in principle, python is preferred.
5. Your grade will be based on the correctness, efficiency and clarity.
6. Please check carefully before submitting to avoid multiple submissions.
8. The email for Q&A: hlicg@connect.ust.hk.
(Please read the guidelines carefully)
1 Comparison of Classifiers (60 marks)
We utilize different classifiers for classification in Assignment 2.
1.1 Data Description
We use the letter dataset from Statlog. The statistics of dataset is shown in Table 1. The class number of the dataset is 26.
Size Features
Train 15000 16
Test 5000 16
Table 1: Data Statistics
1.2 Comparison of Classifiers
You are required to implement the following classifiers and compare the performance achieved by different classifiers.
• Decision Tree
You should form decision trees on dataset in terms of entropy and gini criterions. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score and training time) and give a brief discussion. (30 marks)
• KNN, Random Forest
Apply three different classifiers KNN and Random Forest on the dataset. For each classifier, evaluate the performance (accuracy, precision, recall, f1 score and training time) . You are required to compare the performance of different classifiers and give a brief discussion. (30 marks)
In your report, for each sub-question, you are required to provide a table (just like Table 2 as shown below) followed by a brief discussion.
Classifier Accuracy Precision Recall F1 Training Time

Table 2: Example of summary table
1.3 Note
• In this question, you are FREE to use different Python libraries for implementation.
• The problem in this assignment is a multi-class classification. All the metrics (accuracy, precision, recall, f1 score and training time) should be the average results over the entire test set.
• For classifiers without specified parameters (like KNN, Random Forest), you are free to adjust parameters by yourself.
2 Implementation of Adaboost (40 marks)
2.1 Data Description
Table 3 shows the training dataset. It consists of 10 data and 2 labels.
# 1 2 3 4 5 6 7 8 9 10
x 0 1 2 3 4 5 6 7 8 9
y 1 1 1 -1 -1 -1 1 1 1 -1
Table 3: Training Dataset
2.2 Implementation
We assume the weak classifier is produced by x < v or x > v where v is the threshold and makes the classifier get the best accuracy on the dataset. You should implement the AdaBoost algorithm to learn a strong classifier.
2.3 Note
• Adaboost library is NOT allowed to use. You need to implement it manually and submit your code.
• You should also report the final expression of the strong classifier, such as C∗(x) = sign[α1C1(x)+ α2C2(x) + α3C3(x) + ...], where Ci(x) is the base classifier and αi is the weight of base classifier. You are also required to describe each basic classifier in detail.
• For simplicity, the threshold v should be the multiple of 0.5, i.e., v%0.5==0. For example, you can set v as 2, 2.5, or 3, but you cannot set v as 2.1.
• sign function: https://en.wikipedia.org/wiki/Sign function

More products