Starting from:

$30

CMSC435-Project : Protein Prediction Model Solved

The project asks you to develop, evaluate and compare models for the prediction of proteins that interact with DNA and RNA using a provided dataset. Your model must classify a given protein sequence into one of four outcomes, i.e., interacts with DNA (DNA), interacts with RNA (RNA), interacts with both DNA and RNA (DRNA), and does not interact with DNA or RNA (nonDRNA). Although each group will solve the same task, the corresponding designs should be unique, i.e., collaboration between groups is not allowed. 

 

Datasets
Two datasets are/will be provided:

txt (training dataset) that includes 391 DNA proteins, 523 RNA proteins, 22
DRNA proteins, and 7859 nonDRNA proteins, for the total of 8795 proteins.

txt (blind test dataset) that includes 8795 proteins, with similar proportions between the four classes of proteins. This is an independent test set, which means that entire design procedure (including feature generation, feature selection, parameterization and selection of classifiers, etc.) should be completed using only the training dataset. The test dataset should be used to evaluate your system only once. This dataset will be posted on the class web site 2 days before the project submission deadline and it will not include the annotation of the outcomes. You will have to predict the outcomes and the instructor will process and assess these predictions.
The training dataset is provided in the comma-separated format where each protein is represented by:

the amino acid sequence
the class encoded as DNA, RNA, DRNA, and nonDRNA
Test dataset will be the same format as the training dataset, except that the outcomes will not be provided.

 

Evaluation of Predictions
You are required to perform the 5-fold cross validation when using the training dataset. This cross validation divides the training dataset into 5 random, equal-size subsets, where one subset is used to test the prediction model and the remaining four to train/develop the prediction model; this is repeated 5 times, each time using a different subset as the test set. Consequently, this test results in predicting every sequence in the training dataset. This test procedure is supported by RapidMiner.

 

For each of the four outcomes you will convert the dataset into a binary problem, i.e., a given outcome (positive outcome) vs. all other outcomes (negative outcomes). For example, all proteins that are labeled as DNA will be considered as positive, and the remaining proteins (RNA, DRNA and nonDRNA) as negative. Next, for each of the four outcomes you will compute the following measures:

            Sensitivity = SENS = 100*TP / (TP + FN)

            Specificity = SPEC = 100*TN / (TN + FP)

             PredictiveACC = 100* (TP+TN) / (TP+FP+TN+FN)

             MCC = (TP*TN – FP*FN) / sqrt[(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)]

where TP is the number of true positives (correctly predicted positive outcomes), FP denotes false positives (negative outcomes that were predicted as positives), TN denotes true negatives (correctly predicted negative outcomes), FN stands for false negatives (positive outcomes that were predicted as negatives). You will also compute:

averageMCC = (MCCDNA + MCCRNA + MCCDRNA + MCCnonDRNA)/4 accuracy = 100*TPall / (number of all protein in the dataset)

where MCCDNA, MCCRNA, MCCDRNA, and MCCnonDRNA denote the MCC values when using the DNA, RNA, DRNA, and nonDRNA outcomes as the positives, and TPall is the number of correctly predicted outcomes (DNA proteins predicted as DNA proteins, RNA proteins predicted as RNA proteins, etc.). These measures can be computed based on the confusion matrix. You should round the values to one digit after the decimal point when reporting the accuracy, sensitivities, and specificity and to three digits after the decimal point when reporting MCC. You report must include the confusion matrix for your final/best solution.

 

You must also provide and summarize predictions on the blind test dataset. To do that you will compute your model using the entire training dataset (using the same design, i.e., features, values of parameters, etc., as in your best 5 fold cross validation result) and you will use this model to predict sequences from the blind test dataset. In your report, you must discuss the corresponding results on both the training and blind test dataset; on the blind test dataset you can summarize your results by explaining and comparing how many proteins were predicted with a given outcome.

 

Design
You need to design your predictive model to maximize its predictive performance evaluated based on averageMCC using the 5 fold cross validation on the training dataset. The design may consider: 

Use of different features to encode the input protein sequence. The data mining algorithms require a rectangular dataset with a fixed size and structure of the feature vector for each object (protein). Thus, you will need to convert the input protein sequences (that have variable length) into a fixed set of (numerical) features. Lecture set 7 includes a few suggestions.
Selection of a subset of the input features. This could potentially speed up computation of the model, remove weak/noisy features, and reduce overfitting. Feel free to combine results of multiple feature selection methods.
Selection of the classification algorithm that you will use to compute your model from among many algorithms that are available in RapidMiner.
Parametrization of the selected classification algorithm(s). This involves setting values of their key parameters.
Building a system with multiple models that are used together. For instance, you could use multiple models that predict all 4 classes and combine their results together to generate one prediction. Check the methods in RapidMiner at Operators → Modeling → Predictive → Ensembles.
Different ways to perform the prediction. There are at least two alternatives: use one model to predict all 4 classes vs. use 4 models to predict each of the four classes. In the latter case, you will have to combine the four results to select one “best” result for each protein. The advantage of the second approach is that you can choose different subsets of features and different classification algorithms and their parameters for each outcome/class.
NOTE 1: Ensure that you perform all design activities (e.g., feature selection, selection and parametrization of the classification algorithms, etc.) using the 5-fold cross validation on the training dataset. Otherwise you could overfit this dataset and your results on the test dataset could suffer. 

NOTE 2: Your design should be done incrementally. Start with a simple initial solution (complete the entire design, prediction, and prediction assessment process) and gradually make your design more sophisticated with the objective to improve the predictive performance. In your report, you should clearly indicate one best set of results, which must be selected based on the cross validation results on the training dataset. Moreover, these results should be compared with your intermediate results (earlier/simpler designs, other alternatives, etc.) and with baseline results shown in Table 1, in order to justify your design choices. In your write up, report your results by adding them into Table 1. This will make it easy to compare the different alternatives. Clearly indicate which result is the best/final. You should explain how you made decisions that led you a certain direction of redesigning your model. You also should provide a convincing argument why and how your method is good/competitive in comparison to the baseline result in Table 1.

 

Table 1. Predictive results based on the 5-fold cross validation on the training dataset (this table is available in the Blackboard).

 

Outcome 
Quality measure 
Baseline result
Design 1
Design 2
Design 3 
Best Design
DNA 
Sensitivity
6.9
 
 
 
 
 
Specificity
99.3
 
 
 
 
 
PredictiveACC
95.2
 
 
 
 
 
MCC
0.132 
 
 
 
 
RNA
Sensitivity
39.6
 
 
 
 
 
Specificity
98.9
 
 
 
 
 
PredictiveACC
95.3
 
 
 
 
 
MCC
0.501 
 
 
 
 
DRNA
Sensitivity
4.5
 
 
 
 
 
Specificity
100.0
 
 
 
 
 
PredictiveACC
99.7
 
 
 
 
 
MCC
0.122 
 
 
 
 
nonDRNA
Sensitivity
98.6
 
 
 
 
 
Specificity
29.8
 
 
 
 
 
PredictiveACC
91.3
 
 
 
 
 
MCC
0.428 
 
 
 
 
averageMCC
 
0.265 
 
 
 
 
accuracy
 
90.8
 
 
 
 
 

Deliverables
Each group shall provide the following four deliverables: 

Report that consists of:Cover page that gives the class number and title, date of your submission, name of your group and names of all team members.
Description of the design of the prediction system. You should briefly explain the features that you generated from the input sequences; how and which features were selected; which classification algorithms and their parameters you tried and why and which you have chosen; and which other design options you considered and applied.
Results (see Evaluation of Predictions section). You must organize the results in a table using the format of Table 1. Using this format, compare your best cross validation results with the results from earlier/alternative designs and with the results shown in Table 1. Include confusion matrix for your best solution. Summarize predictions for the blind test dataset.
Conclusions. This is a very important part of your report. You should comment on the quality of your results and compare them against the baseline results from Table 1. Also, describe your experience in this project, and explain advantages and disadvantages of your method and why you think your results are good or bad, in comparison with the other results from Table 1.
Predictions on the blind test dataset. These predictions should be submitted via email to lkurgan@vcu.edu as a text file named with the name of your group, where each row provides prediction for a given “blind” protein. The format should be as follows:
            DNA

            DNA  RNA

            nonDRNA

            …

where DNA, RNA, DRNA and nonDRNA are the predicted outcomes for the protein from the same row in the sequences_test.txt file. The instructor will use these results to evaluate your method on the blind test dataset against the true classes, and these results will be forwarded to you as part of the evaluation of your project.

3. Presentation 
8 minutes long plus 2 minutes for questions&answer session  shall describe the design, results and conclusions             shall include the following parts: 
Motivation for your design. Briefly explain how you arrived at your final design.
Description of your design. Explain (preferably with a diagram) how your method makes the predictions.
Discussion and comparison of the quality of the achieved best results using the results on the training dataset and Table 1.
This part is essential; see the conclusions part of your report.
4. Statement of contributions  
A short document with bullet-point style list of detailed contributions to the project for each team member. The contributions cover all aspects of the project including conceptualization and design of the methodology, implementation, testing, writing the report, preparing the presentation, making the presentation, coordination of the work, notes taking, etc.
The contribution list for each team member should be accompanied with an estimated fraction of the total project effort, quantified in %. The effort estimates across the 5 team members must sum up to 100%. Each team should strive to balance the effort to be 20% for each team member.
This statement will be used to distribute the project grade among the team members.
 

More products