$30
This assignment asks you to develop, evaluate and compare models for the prediction of proteins that interact with nucleic acids using a provided dataset.
Dataset
The dataset (dataset_a3.csv file) is provided in the text-based, comma-separated format where each protein is represented by 8 numeric features and 1 symbolic outcome. The outcome feature (called “Class”) annotates each proteins as Yes (interacting with nucleic acids) vs. No (noninteracting). The dataset includes 8795 proteins, with 936 labeled Yes and 7859 labeled No.
Development of predictive models
You are required to compute models with version 9.3 (or higher) of the RapidMiner Studio using four different algorithms. Three of these four algorithms must be the Decision Tree, SVM and Naïve Bayes. You can choose any of the other predictive algorithms for the fourth methods. You should parametrize each of these algorithms (select the best possible combination of values of their parameters), to the best of your ability, in order to maximize predictive performance. Note that you will need to read, make an educated guess, and/or use trial-and-error approach to figure out which parameters make a difference and how to use them. Do not use the “advanced parameters”. Do not attempt to sample the dataset, i.e., do not perform feature or sample/object selection.
Evaluation and comparison of predictive models
You must evaluate the predictive performance using accuracy (“% of correctly classified instances”). For each algorithm you must perform three types of tests:
- on the entire dataset (“use training dataset”)
- on 50% of the dataset; you will use the other 50% to compute the model (“percentage split”) - using the 5 fold cross-validation
The 5 fold cross-validation divides the dataset at random into 5 equal-size subsets, where one subset is used to test the model and the remaining nine to compute the prediction model. This is repeated 5 times, each time using a different subset as the test set. Consequently, this results in predicting every protein in the dataset. This test type is implemented in the RapidMiner Studio with the “Cross Validation” operator where the number of folds is set to 5.
Deliverables
1. List and briefly describe the methods that you used (one sentence per method). Provide a list key parameters for each method, i.e., parameters that allowed you to improve results when compared with the default parameter values. The key parameters could/should be a subset of all available parameters.
2. Using the table shown below, report the accuracies for the four algorithms and the three test types. The accuracy values must be reported with two digits after the decimal point, e.g., 91.05. You must include the accuracies of the models that use the default parameters and the
best selected parameters. In total, you have 4*3*2 = 24 results to report. List the best selected values of parameters for each model and each test type.
3. Briefly explain which of the three types of the tests would be appropriate to give the most reliable estimate of predictive performance, i.e., the performance that a user of your model should expect to observe on new proteins that were not included in the provided dataset.
4. Discuss whether trying multiple algorithms and adjusting their parameters helped you in developing a more accurate predictive model. If yes then comment on whether the corresponding amount of the improvement justifies the amount of effort. Make sure that you rely the most appropriate test results (see question 3) when answering this question.
5. Discuss whether the accuracy of your most accurate model is sufficient for practical purposes. Justify your answer.
6. Give “confusion matrix” for the most accurate result computed based on the cross validation experiments (selected among the 8 corresponding experiments). Use this matrix to explain whether this predictor would be suitable to identify proteins that interact with nucleic acids (Class = Yes), proteins that do not interact with nucleic acids (Class = No), or both types of proteins.
NOTES
- Use a separate, clearly marked paragraph for each of the six deliverables.
- The table from the second deliverable must be in the following format; for your convenience this table is provided in the word docx format on the Blackboard. Example values are in green.
Reported information
Test type
Decision Tree
SVM
Naïve Bayes
Accuracy with default parameters
Entire dataset
12.34
50%
23.45
Cross-validation
34.56
Accuracy with best parameters
Entire dataset
45.67
50%
56.78
Cross-validation
67.89
List names of parameters
maximal depth
apply pruning
…
criterion
List selected best values of parameters (in the same order as in the list of names)
Entire dataset
10
True
…
gain_ratio
50%
13
True
…
gain_ratio
Cross-validation
-1
False
…
informaton_gain