Starting from:

$30

CS464-Homework 1 Birthday Paradox and Medical Diagnosis Solved

1         Birthday Paradox
This question is changed to be an exercise question.

Assume that you are a student in an auditorium. Your math professor, David Caine, starts the semester with an interesting question: What is the probability that any two student in the auditorium has the same birthday?

Question 1.1  Ignoring the existence of leap years, find and plot the probability that at least two students have the same birthday in an auditorium of n students where n ∈ {2,3,5,10,20,30,40,50,75,100}. Do not forget to add proper axis titles and legend (if necessary).

Question 1.2  Ignoring the existence of leap years, what is the minimum number of students to make sure that the probability of any two student have the same birthday will be 100%? Explain your reasoning explicitly.




2         Medical Diagnosis
This question is changed to be an exercise question.

Assume that, as a clinic worker, you are asked to conduct lab tests for diagnosis of a disease, namely ML464HW1. From experiments, it is known that any person in the population either has the disease (positive), or has not (negative), i.e. there is no carrier. Over the entire population of people only 1.1% have this disease and the lab test returns a correct positive result in only 94% of the people in which the disease is actually present and a correct negative result in only 98% of the people in which the disease is not present. The state of the patient is represented with random variable S and the test results are represented with random variable T.

Question 2.1  Obtain and report the following probabilities:

− P(S = disease)

− P(S = healthy)

− P(T = positive|S = disease)

− P(T = negative|S = disease) − P(T = positive|S = healthy)

− P(T = negative|S = healthy)

Question 2.2  Suppose a new patient comes to the clinic and the test returns a positive result. Show whether the patient should be diagnosed as having the disease or not. Explain your reasoning explicitly.



Question 2.3  Assume that the person is tested multiple times to make sure the result is confident if the test comes positive. What is the minimum number of tests you should conduct to diagnose the patient in the previous part as sick? Explain your reasoning explicitly.



3         HIV-1 Protease Cleavage 
Dataset
A protease is an enzyme that is responsible for cleaving proteins into polypeptides or single amino acids. A specific protease, namely HIV-1 PR, cleaves polyproteins into smaller proteins that is essential for an HIV virion and allows the HIV virus to be infectious.

k-mers are subsequences of length k contained within any sequence. As an example assume that you have the sequence ABCDEABCD, a table for possible k-mers of this sequence is given below.



Your task for this question is to identify amino acid 8-mers that can be cleaved by HIV-1 PR in the central position (between amino acids 4 and 5). Your dataset is taken from [1], which consist of 6590 labeled 8-mers in total. We have merged these datasets and created train and test sets of sizes 6062 and 528, respectively. You will use the following files for this question:

•   q2trainset.txt

•   q2test set.txt

•   q2gagsequence.txt

Each amino acid in the sequence is labeled using one-hot encoding, and since there are 20 amino acids in total, we have 160 features for an 8-mer. Each line in given files corresponds to a single 8-mer. First 20 columns denote the first amino acid, the second 20 columns denote the second amino acid, and so forth. The amino acids are enumerated as follows:

− G (Glycine), index: 0

− P (Proline), index: 1

− A (Alanine), index: 2

− V (Valine), index: 3

− L (Leucine), index: 4

− I (Isoleucine), index: 5

− M (Methionine), index: 6

− C (Cysteine), index: 7

− F (Phenylalanine), index: 8

− Y (Tyrosine), index: 9

− W (Tryptophan), index: 10

− H (Histidine), index: 11

− K (Lysine), index: 12

− R (Arginine), index: 13

− Q (Glutamine), index: 14

− N (Asparagine), index: 15

− E (Glutamic Acid), index: 16

− D (Aspartic Acid), index: 17

− S (Serine), index: 18

− T (Threonine), index: 19

For a 4-mer AWNT for instance, there would be 80 features and all columns would be 0 except the columns with indices 2,30,55 and 79, as represented in the figure below.



The final column in each row represents the ground truth label for the corresponding 8-mer. The label for any 8-mer is either ”1”, meaning it can be cleaved by HIV-1 PR in the middle, or ”0”, meaning HIV-1 PR cannot cleave this 8-mer in the middle.

Bernoulli Naive Bayes Model
The naive assumption allows us to think of all features as conditionally independent from each other so that we can multiply each P(Xi = xi|Y = yk), instead of considering their joint probabilities.

ni

                                               P(Si |Y = yk) = P(X1 = x1,X2 = x2,...,Xn = xn) = YP(Xj = xj |Y = yk)                (3.1)

j=1

In Eq. 3.1, Xj represents a binary one-hot encoded feature that is described above, whose value is either 1 or 0. Si denotes the ith 8-mer sequence. Y represents the ground truth value for Si. We have 2 classes, therefore yk can be 0 or 1. We would like to find P(Y = 0|X1 = x1,X2 = x2,...,Xn = xn) and P(Y = 1|X1 = x1,X2 = x2,...,Xn = xn) so that the classification can be performed. Using the naive assumption and Bayes’ equation together, we obtain the following relation:

                       P                                                                                               (3.2)

In Eq. 3.2, ni denotes the total number of features. Please note that for both yk = 0 and yk = 1, the denominator of Eq. 3.2 remains the same. Therefore, to compare probabilities for each class, we can ignore the denominator value and define P(Y = yk|Si) as

ni

                                                                  P(Y = yk|Si) ∝ P(Y = yk) YP(Xj |Y = yk)                                   (3.3)

j=1

Probabilities are floating point numbers and multiplying many probability values could cause numerical inaccuracies or overflows. Therefore, it is a good practise apply log(x) on these probability values to avoid any computational issues. After taking log, the predicted label can be found using the equation below.

                                                               (3.4)

where ˆyi is the predicted label for the i-th example.

The parameters to learn and their MLE estimators are as follows:



•    Ti,j,0 is the number of occurrences of the amino acid j at ith position in 0 labeled instances of the training set.

•    Ti,j,1 is the number of occurrences of the amino acid j at ith position in 1 labeled instances of the training set.

•    N0 is the number of instances with label 0 in the training set.

•    N1 is the number of instances with label 1 in the training set.

•    N is the total number of 8-mers in the training set.

•    πy=0 estimates the probability that any particular 8-mer will belong to class 0.

•    πy=1 estimates the probability that any particular 8-mer will belong to class 1.

•    θi,j |y=0 estimates the probability that a particular 8-mer will contain the amino acid j at ith position in ”0” labeled instances of the training set.P(Xj |Y = 0)

•    θi,j |y=1 estimates the probability that a particular 8-mer will contain the amino acid j at ith position in ”1” labeled instances of the training set.P(Xj |Y = 1)

For all questions after this point, consider your test set as a validation set and assume that there is another test set that is not given to you.

Question 3.1 Train a Bernoulli Naive Bayes model on the training set and evaluate your model on the test set given. Find and report the accuracy for the test set.

If it arises in your code, define log0 as it is, that is -inf. In case of ties, you should predict ”0”. Report your test set accuracy in your report.

Question 3.2  HIV-1 protease cleaves polyproteins such as ”Group-specific antigen (Gag)”. A sample gag polyprotein amino acid sequence of length 500 is given to you in a file with name q2gagsequence.txt [2].

− Use this file to create 493 8-mers by sliding a window of size 8 and one-hot encode amino acids in an 8-mer using the enumeration given in the Dataset section to obtain a dataset of size 493 × 160 for the gag polyprotein.

− Using your trained model, find all the locations where HIV-1 PT would cleave this polyprotein. Report your result by giving the exact amino acid indices for the whole 500 amino acid sequence where cleavage would happen (starting from index 0). For instance, if we had a polyprotein of 20 amino acids and our model labeled our 4th and 10th 8-mers as 1, then the cleavage would happen between amino acid indices 7-8 and 13-14, resulting with 3 smaller polypeptides in the end of lengths 8, 6 and 6 amino acids. Please remember that when your model label an 8-mer as 1, the cleavage would happen between the 2 amino acids at the center. GY WRKDIM would be cleaved between R and K amino acids if it is labeled as 1.

Question 3.3 Using your trained model, find the 8-mer for which your model assigns it to class 1 with highest probability. Similarly, find the 8-mer for which your model assigns it to class 0 with lowest probability. Discuss your findings.

Question 3.4  For Bernoulli Naive Bayes model, MLE estimation of the parameters are obtained using Bernoulli distribution. Now, let us change the prior distribution for parameters by introducing additive or Laplace smoothing.

Extend your classifier so that it can compute MAP estimates of θ parameters using a fair Dirichlet prior. This corresponds to additive smoothing. The prior is fair in the sense that it presumes that each amino-acid appears additionally α times at a specific position in all train set.

This time the parameters to learn and their MAP estimators are as follows:



For this question, use α values in {0,1,2,3,4,5,6,7,8,9,10} and train your classifier using all of the training set and have it classify all of the test set and report test set classification accuracy for all given α values. Plot α vs. test set accuracy values and put your plot in your report. In addition, use the same set of α values given above to train your model using only first 75 rows of the training set and have it classify all of the test set. Report accuracy value and plot α values vs. test set accuracy once more. Do not forget to add proper axis names, plot title and legend to your plots. Compare your results from this question with the results from Question 3.1 and discuss the effect of additive smoothing. What happens when the amount of data you have is insufficient ? How does MAP estimates of parameters affect your performance when the amount of training data is small ? Explain your findings clearly.

Question 3.5  For this part of the question, you are instructed to obtain mutual information values between each of the 160 features and the labels. You can check [4] to gain an insight about mutual information. Calculate mutual information of each feature with respect to label. If you encounter with a division by zero error, simply set the corresponding mutual information value to +inf. After finding mutual information values, perform the following operations in order.

− Sort mutual information values in descending order and also keep the original indices.

− Train a model (using MLE estimation for learning its parameters) using all training set instances, but only use first k features with the highest mutual information. Repeat this step for ∀k ∈ {1,2,3,...,160}. For each run, also calculate the test set accuracy and save them.

− After the step above is complete, find the maximum accuracy value and the corresponding k value for that accuracy value, and report them. Is the maximum accuracy you found higher than the accuracy value you obtained for Question 3.1 ?

Question 3.6  For this part of the question, you are expected to apply principal component analysis (PCA) to the given dataset.

− Apply PCA on amino acid 8-mers, Xi, to obtain first 3 principal components for each Xi. Plot 3D projections of amino acid 8-mers by using these components.

− Report proportion of variance explained (PVE) for the obtained principal components. Discuss your results.

− Discuss whether it is feasible to apply PCA to given data set. Explain your reasoning.

More products