$25
In this assignment, you are asked to implement Logistic Regression Classifier and Decision Tree classifier. You also need to use scikit-learn to test out these classifiers.
Task 1: Dataset Generation [5 + 10 = 15]
Download the winequality-red.csv from this lin k. It contains various chemical properties of red wine samples, along with the quality of wine. We want to train classifiers to predict the quality.
We shall create two modified datasets from this data.
A. Convert all the values in quality attribute to 0 (bad) if the value is less than or equal to ‘6’ and to 1 (good) otherwise. Normalize all the other attributes between 0 and 1 by min-max scaling. Use this dataset (dataset A) for Logistic Regression .
B. Convert all the values in quality attribute to 0 (bad) if the value is less than ‘5’, to 1 (good) if the value is ‘5’ or ‘6’ and to 2 (great) otherwise. Normalize all the other attributes by Z-score normalization, and segregate them into 4 equal spaced bins each giving the values between [0 to 3], and replace the values for that attribute with the number corresponding to the interval they belong.
For example, suppose after normalization an attribute has values between [-0.5,1.5], i.e., minimum value of the attribute is -0.5 and maximum value is 1.5, then form 4 bins:
bin 0: [-0.5,0.0], bin 1: [0.0,0.5], bin 2: [0.5,1.0], bin 3: [1.0,1.5].
Now, if a data instance has a value of 0.73 for that attribute, replace 0.73 with 2.
Use this dataset (dataset B) for Decision Tree.
Task 2: Logistic Regression [25 + 5 + 5 = 35]
Use dataset A for this part.
1. Implement a standard Logistic Regression Classifier as discussed in class. Do NOT use scikit-learn for this part.
2. Test out the implementation of Logistic Regression from scikit-learn package, using saga solver and using no regularization penalty.
3. Cross validate both the classifiers with 3-folds and print the mean accuracy, precision and recall for the class 1 (good) for both the classifiers. You may or may not use the scikit-learn implementations for computing these metrics and cross validation.
Task 3: Decision Tree [35 + 5 + 10 = 50]
Use dataset B for this part.
1. Implement the standard ID3 Decision tree algorithm as discussed in class, using information gain to choose which attribute to split at each point. Stop splitting a node if it has less than 10 data points. Do NOT use scikit-learn for this part.
2. Test out the implementation of Decision Tree Classifier from scikit-learn package, using information gain. Here also stop splitting a node if it has less than 10 data points.
3. Cross validate the classifiers with 3-folds and print the mean macro accuracy, macro precision and macro recall for both the classifiers. You may or may not use the scikit-learn implementations for computing these metrics and cross validation.