$30
Political blogs dataset
We will study a political blogs dataset first compiled for the paper Lada A. Adamic and Natalie Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). It is assumed that blog-site with the same political orientation are more likely to link to each other, thus, forming a “community” or “cluster” in a graph. In this question, we will see whether or not this hypothesis is likely to be true based on data.
The dataset txt contains a graph with n = 1490 vertices (“nodes”) corresponding to political blogs.
The dataset txt contains edges between the vertices. You may remove isolated nodes (nodes that are not connected any other nodes) in the pre-processing.
We will treat the network as an undirected graph; thus, when constructing the adjacency matrix, make it symmetrical by, e.g., set the entry in the adjacency matrix to be one whether there is an edge between the two nodes (in either direction).
In addition, each vertex has a 0-1 label (in the 3rd column of the data file) corresponding to the true political orientation of that blog. We will consider this as the true label and check whether spectral clustering will cluster nodes with the same political orientation as possible.
(10 points) Assume the number of clusters in the graph is k. Explain the meaning of k here intuitively.
(10 points) Use spectral clustering to find the k = 2,5,10,20 clusters in the network of political blogs (each node is a blog, and their edges are defined in the file txt). Then report the majority labels in each cluster, for different k values, respectively. For example, if there are k = 2 clusters, and their labels are {0,1,1,1} and {0,0,1} then the majority label for the first cluster is 1 and for the second cluster is 0. It is required you implementing the algorithms yourself rather than calling from a package.
(10 points) Now compare the majority label with the individual labels in each cluster, and report the mismatch rate for each cluster, when k = 2,5,10, For instance, in the example above, the mismatch rate for the first cluster is 1/4 (only the first node differs from the majority) and the the second cluster is 1/3.
(10 points) Tune your k and find the number of clusters to achieve a reasonably small mismatch rate. Please explain how you tune k and what is the achieved mismatch rate.
(10 points) Please explain the finding and what can you learn from this data analysis.
1
2. Eigenfaces and simple face recognition
This question is a simplified illustration of using PCA for face recognition. We will use a subset of data from the famous Yale Face dataset.
Remark: You will have to perform downsampling of the image by a factor of 4 to turn them into a lower resolution image as a preprocessing (e.g., reduce a picture of size 16-by-16 to 4-by-4). In this question, you can implement your own code or call packages.
First, given a set of images for each person, we generate the eigenface using these images. You will treat one picture from the same person as one data point for that person. Note that you will first vectorize each image, which was originally a matrix. Thus, the data matrix (for each person) is a matrix; each row is a vectorized picture. You will find weight vectors to combine the pictures to extract different “eigenfaces” that correspond to that person’s pictures’ first few principal components.
(25 points) Perform analysis on the Yale face dataset for Subject 1 and Subject 2, respectively, using all the images EXCEPT for the two pictures named subject01-test.gif and subject02-test.gif. Plot the first 6 eigenfaces for each subject. When visualizing, please reshape the eigenvectors into proper images. Please explain can you see any patterns in the top 6 eigenfaces?
(25 points) Now we will perform a simple face recognition task.
Face recognition through PCA is proceeded as follows. Given the test image subject01-test.gif and subject02-test.gif, first downsize by a factor of 4 (as before), and vectorize each image. Take the top eigenfaces of Subject 1 and Subject 2, respectively. Then we calculate the normalized inner product score of the 2 vectorized test images with the vectorized eigenfaces:
(eigenface)Ti (test image)j
sij =
k(eigenfacei)k · k(test image)jk
Report all four scores: sij, i = 1,2, j = 1,2. Explain how to recognize the faces of the test images using these scores.
(Bonus: 5 points) Explain if face recognition can work well and discuss how we can improve it possibly.