Starting from:

$25

CSE472 - Assignment 1 - Logistic Regression and AdaBoost for Classification  - Solved

Machine Learning Sessional


 Introduction 
In ensemble learning, we combine decisions from multiple weak learners to solve a classification problem. In this assignment, you will implement a Logistic Regression (LR) classifier and use it within AdaBoost algorithm. For any query about this document, 

 

Programming Language/Platform 

 Python 3 [Hard requirement]

 

Dataset preprocessing 
You need to demonstrate the performance and efficiency of your implementation for the following three different datasets.  

1.      https://www.kaggle.com/blastchar/telco-customer-churn 

2.      https://archive.ics.uci.edu/ml/datasets/adult 

3.      https://www.kaggle.com/mlg-ulb/creditcardfraud 

They are different in terms of size, number and types of attributes, data quality (missing attribute values), data descriptions (whether train and test data are separate, attribute description format etc.) etc. Your core implementation for both LR and Adaboost model must work for all three datasets without any modification. You can (possibly need to) add a separate dataset-specific preprocessing script/module/function to feed your learning engine a standardized data file in matrix format. On the day of submission, you are likely to be given another new (hopefully smaller) dataset for which you need to create a preprocessor. Any lack of understanding about your own code will severely hinder your chances to make it. Here are some suggestions for you,

1.      Design and develop your own code. You can take help from tons of materials available on the web, but do it yourself. This is the only way to ensure that you know every subtle issue needed to be tweaked during customization.

2.      Don’t assume anything about your dataset. Keep an open mind. Deal with their subtleties in preprocessing.

3.      To get an idea about different data preprocessing tasks and techniques, specifically how to handle missing values and numeric features using information gain [AIMA 3rd  ed.18.3.6] visit the following link http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html 

4.      Use Python library functions for common preprocessing tasks such as normalization, binarization, discretization, imputation, encoding categorical features, scaling etc. This will make your life easier and you will thank us for enforcing Python implementation. Visit http://scikit-learn.org/stable/modules/preprocessing.html for more information.

5.      Go through the dataset description given in the link carefully. Misunderstanding will lead to incorrect preprocessing.  

6.      For the third dataset, don’t worry if your implementation takes long time. You can use a smaller subset (randomly selected 20000 negative samples + all positive samples) of that dataset for demonstration purpose. Do not exclude any positive sample, as they are scarce.

7.      Split your preprocessed datasets into 80% training and 20% testing data when the dataset is not split already. All of the learning should use only training data. Test data should only be used for performance measurement. You can use Scikit-learn built-in function for train-test split. See https://developers.google.com/machinelearning/crash-course/training-and-test-sets/splitting-data for splitting guidelines.  

 

Logistic Regression Tweaks for weak learning 
1.      Use information gain to evaluate attribute importance in order to use a subset of features.

2.      Control the number of features using an external parameter.

3.      Early terminate Gradient Descent if error in the training set becomes < 0.5.

Parameterize your function to take the threshold as an input. [If you set it to 0, then Gradient Descent will run its own natural course, without early stopping]

4.      Use tanh function (instead of sigmoid). You need to calculate the gradient and derive the update rules accordingly.

  

 Adaboost implementation 
1.               Use the following pseudo-code for Adaboost implementation 

 

2.      As the weak/base learner use logistic regression. You can explore different ways to speed up the learning of the base models, sacrificing the accuracy, so long as the learning perform better than random guess (i.e. weak learner). For example, you can use a small subset of features or reduce the number of iterations in gradient descent etc. You can come up with your novel idea too. 

3.      Adaboost should treat the base learner as a black box (in this case a decision stump) and communicate with it via a generic interface that inputs resampled data and outputs a classifier.  

5.      In each round, resample from training data and fit current hypothesis (linear classifier) using resampled data but calculate the error over original (weighted) training data.

6.      Use +1 for positive decision and -1 for negative decision, so that the sign of your combined majority hypothesis indicates decision.  

7.      After learning the ensemble classifier evaluate performance over test data. Don’t get confused over which dataset to use at which step.

 

Performance evaluation 
1.      Always use a constant seed for any random number generation so that each run produces same output.  

2.      Report the following performance measure of your logistic regression implementation on both training and testing data for each of the three datasets. Use the following table format for each dataset.

Performance measure 
Training 
Test 
Accuracy
 
 
True positive rate (sensitivity, recall, hit rate)
 
 
True negative rate (specificity)
 
 
Positive predictive value (precision)
 
 
False discovery rate
 
 
F1 score
 
 
 

3.      Report the accuracy of Adaboost implementation with logistic regression (K=5, 10, 15 and 20 rounds) on both training and testing data for each of the three datasets.

Number of boosting rounds 
Training 
Test 
5
 
 
10
 
 
15
 
 
20

More products