Starting from:

$30

CSCI3220-Assignment 5 Solved

In this, you will work out some details of using the quad tree structure in hierarchical clustering. You are encouraged to use some ways (which could be, but not necessarily, writing a small program) to do some tedious calculations. You will then implement the min-heap data structure. Finally, you will answer some questions related to the k-means clustering method.

 

 

Questions:

 

1.       This question is about hierarchical clustering. You are given the following gene expression data matrix X of five genes g1-g5 measured in six samples s1-s6.

X
s1
s2
s3
s4
s5
s6
g1
4
6
5
1
2
8
g2
1
4
6
8
1
3
g3
2
9
3
5
1
6
g4
8
5
2
1
3
4
g5
7
1
1
3
2
9
 

(a)           Produce the pair-wise distance matrix among the five genes, where the distance between any two genes gi and gj is defined as their squared Euclidean distance, 𝑑"𝑔$, 𝑔&’ = ∑ 0,12"*+3,-*.,’/, in which xik is the expression level of gene gi in sample sk. Give distance values up to 1 decimal point.  

(b)   Suppose you want to perform agglomerative hierarchical clustering of the five genes based on the distance matrix in Part a, using a quad tree to index the distance values. Show the initial quad tree, with the different levels of the tree clearly indicated.

 

(c)           For each of the following link types, show the updated quad tree after the first merge of two clusters: i) single-link; ii) average-link; iii) complete link. For the two merging clusters, the original entries for the one with the smaller index will be used to store the distances of the new cluster.           

2.       Write a computer program called MinHeap in C, C++, Java or Python that builds a min-heap for a list of integers and remove the smallest integer one by one until the heap is empty. Specifically, the input includes the followings:

Ÿ   The number of integers on the list

Ÿ   The integers, one per line, which can be assumed to be all unique

 

The program should treat the input integers as an array that follows their input order and turn it into a heap. The program should print to standard output (stdout) this initial heap and the updated heap after removing each smallest value. Each heap should be printed as a single commadelimited list on a separate line.

 

 

For example, if your implementation is Java, below is a typical run of the program (based on an example from the lecture notes):

 

java MinHeap

10

13

5

10

4

11

1

9

12

8

6

1,4,9,5,6,10,13,12,8,11

4,5,9,8,6,10,13,12,11

5,6,9,8,11,10,13,12

6,8,9,12,11,10,13

8,11,9,12,13,10

9,11,10,12,13

10,11,13,12

11,12,13

12,13

13
 

The main part of your program is expected to contain no more than 100 lines of code.

 

Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether it follows the input/output formats as specified above, iii) its accuracy on a number of test cases and iv) whether the program is well-documented with appropriate comments added to explain the meaning of the code.      (50%)

 

 

3. This question is about k-means.

 

(a)   Suppose the data set contains an outlier point that is very different from all the other points.

                Describe how it would affect the clusters produced by k-means.                       

 

(b)   Suppose the data set contains a column with values much larger in magnitude than the other columns. Describe how it would affect the clusters produced by k-means.

 

(c)   There is another clustering method that is similar to k-means, but instead of assigning every point to the cluster with the closest representative, it assigns every point to each cluster in a fuzzy manner according to its distance to the cluster representative. For example, suppose k=2 and a point has a distance of 10 from the representative of cluster 1 and a distance of 20 from the representative of cluster 2, it is assigned a membership value of (1/10)/[(1/10)+(1/20)] = 2/3 to cluster 1 and a membership value of (1/20)/[(1/10)+(1/20)] = 1/3 to cluster 2. Describe one advantage and one disadvantage of this method as compared to k-means.        

                                  

(d)   [In practice, the number of clusters is often not known. Propose a way to determine an appropriate value of k based on the input data set

More products