$30
1 Decision Trees
Implement a function in python:
DT_train_binary(X,Y,max_depth)
that takes training data as input. The labels and the features are binary, but the feature vectors can be of any finite dimension. The training feature data (X) should be structured as a 2D numpy array, with each row corresponding to a single sample. The training labels (Y) should be structured as a 1D numpy array, with each element corresponding to a single label. Y should have the same number of elements as X has rows. max depth is an integer that indicates the maximum depth for the resulting decision tree. DT train binary(X,Y,max depth) should return the decision tree generated using information gain, limited by some given maximum depth. If max depth is set to -1 then learning only stops when we run out of features or our information gain is 0. You may store a decision tree however you would like, i.e. a list of lists, a class, a dictionary, etc. Write-Up: describe your implementation concisely. Bulleted lists are okay!
Binary data for testing can be found in in data 1.txt and data 2.txt.
Implement a function in python:
DT_test_binary(X,Y,DT)
that takes test data X and test labels Y and a learned decision tree model DT, and returns the accuracy (from 0 to 1) on the test data using the decision tree for predictions. Write-Up: describe your implementation concisely.
1
Implement the following function in python:
DT_make_prediction(x,DT)
This function should take a single sample and a trained decision tree and return a single classification. The output should be a scalar value. Write-Up: describe your implementation concisely.
622 Implement the following functions in python:
DT_train_real(X,Y,max_depth)
DT_test_real(X,Y,DT)
These functions are defined similarly to those above except that the features are now real values. The labels are still binary. Your decision tree will need to use questions with inequalities: , ≥, <, ≤. Write-Up: describe your implementation concisely. Real-valued data for testing is provided in data 3.txt
2 Nearest Neighbors
Implement a function in python:
KNN_test(X_train,Y_train,X_test,Y_test,K)
that takes training data, test data, and K as inputs. KNN test(X train,Y train,X test,Y test,K) should return the accuracy on the test data. The training data and test data should have the same format as described earlier for the decision tree problems. Your function should be able to handle any dimension feature vector, with real-valued features. Remember your labels are binary, and they should be −1 for the negative class and 1 for the positive class. data 4.txt has some test data for your KNN algorithm. Try K = 1, K = 3, and K = 5. Write-Up: describe your implementation concisely.
622 Implement the following function in python:
choose_K(X_train,Y_train,X_val,Y_val)
that takes training data and validation data as inputs and returns a K value. This function must iterate through all possible K values and choose the best K for the given training data and validation data. K must be chosen in order to achieve the best accuracy on the validation data. Write-Up: describe your implementation concisely.
3 Clustering
Implement a function in python:
K_Means(X,K,mu)
that takes feature vectors X and a K value as input and returns a numpy array of cluster centers C. Your function should be able to handle any dimension of feature vectors and any K 0. mu is an array of initial cluster centers, with either K or 0 rows. If mu is empty, then you must initialize the cluster centers randomly. Otherwise, start with the given cluster centers. Write-Up: describe your implementation concisely. data 5.txt and data 6.txt have some sample data for you to test out your implementation.
622 Implement the following function in python:
K_Means_better(X,K)
that takes feature vectors X and a K value as input and returns a numpy array of cluster centers C. Your function should be able to handle any dimension of feature vectors and any K 0. Your function will run the above-implemented K Means(X,K,[]) function many times until the same set of cluster centers are returned a majority of the time. At this point, you will know that those cluster centers are likely the best ones. K Means better(X,K) will return those cluster centers. Write-Up: describe your implementation concisely.