Starting from:

$30

CSCI-B 505-Applied Algorithms 4 Solved

Contents
Problem 1: Greedy Algorithm (EM)                                                                                                                                 2

Problem 2: The EM Experiments                                                                                                                                       2

Problem 3: Algorithm Design                                                                                                                                             3

Directions                                                                                                                                                                                   3

Appendix                                                                                                                                                                                    4


Problem 1: Expectation-Maximization Algorithm for Clustering 
Implement expectation-maximization algorithm for Gaussian mixture models (see the EM algorithm below) in Python and call this program Gk. As you present your code explain your protocol for

1.    initializing each Gaussian

2.    deciding ties

3.    stopping criteria

Problem 2: Analysis of the EM over Real-world Data Sets
Run your EM program, Gk, against the Ringnorm and Ionosphere data sets. Discuss your results.

•   Ringnorm Data Set

•   Ionosphere Data Set

Run Gk with k = 2,...,5 (20 runs each for each k). Report error rates and iteration counts for each k using whisker plots. An example of whisker can be found in the appendix.

A simple error rate can be calculated as follows:

• After Gk converges, combine the clusters to ended up with two clusters for any k as follows: Since the true clusters are known for a given arbitrary blocks number, final clusters are determined by measuring the Euclidean (this is the easiest choice) distances between true cluster centers and predicted cluster centers.

In other words, you will always calculate the error for k = 2 since there are only 2 clusters in the given data sets. Below is an example of error calculation for Ionosphere data set. You can similarly calculate an error rate for Ringnorm data set. After Gk converges, partition the data points to k-clusters, C1,...,Ck using likelihoods (hard assignment). A data point in a cluster is classified as good if P(gi) > P(bi) and bad otherwise.

For each cluster Ci form two counts (over Ionosphere Data Set):

                                                                     gi                    ← X [δj = g], good

δj∈Ci

                                                                     bi                       ← X [δj == b], bad

δj∈ci.B

where [x = y] returns 1 if True, 0 otherwise. For example, [2 = 3] + [0 = 0] + [34 = 34] = 2 We can now calculate a simple error rate. Assume Gi is good. Then the error is:

 

We can find the total error rate easily:

 

Problem 3: Algorithm Design 
Problem 3.1
Given a text D and a pattern P, describe an Ω(d+p) time method for finding the longest prefix of P that is a substring of D. The lengths of D and P are d and p, respectively.

Problem 3.2
X, Y, and Z are three arrays and each has m elements. For an arbitrary integer t, describe O(m2logm)-time algorithm to determine if there exist numbers, x in X, y in Y, and z in Z, such that t = x+y+z.

Problem 3.3
Describe an efficient algorithm for deleting a string from a compressed trie and analyze its running time.

Directions
Please follow the syllabus guidelines in turning in your homework. While testing your programs, run them with a variety of inputs and discuss your findings. This homework is due Sunday, Nov 23, 2021 10:00pm. OBSERVE THE TIME. Absolutely no homework will be accepted after that time. All the work should be your own.

Appendix
The EM algorithm
This part is provided to help you implement the EM algorithm.

Let D = {xj | j = 1,...,n} be the data set where each xj ∈ ℜd (ℜ: Reals) and D is a mixture of a Gaussian. Given D, the number of blocks k, and convergence threshold ϵ, the EM-T algorithm partition data into k clusters, C = {C1,C2,...,Ck}, where each Ci ∈ C can be characterized as a Gaussian distribution. If each cluster Ci ∈ C can be represented by a multivariate normal distribution (MVN):

 

where µi ∈ ℜd and Σi ∈ ℜd×d denote cluster mean and covariance matrix for cluster Ci, respectively, and they are unknown parameters. f(xj|µi,Σi) represents the probability density at xj for cluster Ci. Lastly, D is a mixture of C1,C2,...,Ck.

The algorithm iteratively alternates between (1) computing log-likelihood of each data point being from each Gaussian (E-step) (2) recalculating the parameters (M-step). Iteration continues until a set of means is stable.

• Initialization:

–    µi is randomly selected from D for each cluster.

–    Σi ← I. For each cluster, the covariance matrix is a d × d identity matrix.

–    P(Ci) = k1. The priors are uniformly initialized.

 


Figure 1: An example of whisker plot

More products