$25
Question 1
While simrank is suitable for search engines, how would you adapt it for display ads where are there are no queries?
Question 2
Consider the following methods to compute user and item biases in neighborhood models for recommendation.
Figure 1: Question 4: figure illustrating the integration of the binary view of the rating matrix, into matrix factorization
where R(i) denotes the set of users who have rated on item i, and R(u) denotes the set
of items rated by user u. Explain the purpose of constants λ2 and λ3.
Question 3
In the matrix factorization model for recommendation, how do you determine the number of latent factors?
Question 4
Consider figure 1 which illustrates factoring in the binary view of the rating matrix, into matrix factorization. Why does the information from the (dense) binary view (which is derived from the rating matrix) produce a 7% improvement in the results over direct matrix factorization on the rating matrix?
Question 5 (7 + 14 = 21)
The objective of this assignment is to design a movie recommender system, i.e., recommend movies to users using item-item neighborhood models. The dataset includes ratings provided by users to some movies, and movie metadata that includes title, overview, etc.
Let the set of users be represented by U = {u ∈ U} and set of movies by represented by I = {i ∈ I}. First, estimate movie biases {bi : i ∈ I} by averaging over users that rated the movie, and then estimate user biases by averaging residuals over movies rated by the user.
where µ is the global mean calculated across all the ratings available in the dataset. The
rating for user u and movie i is predicted as:
where sij denotes the similarity between movies i and j. Note that the neighborhood R(u) is computed over all items rated by user u. In this assignment, we explore two approaches to compute movie-movie similarities.
1. Pearson correlation: Estimate movie-movie similarity using empirical Pearson correlation coefficient on shared support of items i and j, defined as:
where U(i,j) is the set of users who have rated movies i and j, and bui = bu + bi + µ
2. Content similarity:
Movie metadata is a collection of words, sij is the cosine similarity between the two movie documents i and j. In order to calculate cosine similarity, we first convert the documents to a vectors using TF-IDF.
Term Frequency also known as TF, measures the number of times a term (word) occurs in a document, normalized by document length.
Number of times term t appears in document of movie i
TF(t,i) =
Total number of terms in document of movie i
Inverse Document Frequency also known as IDF, measures how important a term is to a particular movie.
Total number of movies in dataset
Number of movies containing term t
Now let the set of all unique words across all movie documents be V = {w1,...,w|V |}.
We define the vector for movie ) where dik = TF(wk,i)×IDF(wk)
The similarity between movies i and j, is now computed as:
sij = cosine-sim(di,dj)
In this MP, given an input (u,i) , you need to output the estimated rating value ˆrui. Round ˆrui to 1 decimal point.
Input format: The input contains the (user,movie) ratings, movie metadata and the (user,movie) pairs for which you need to estimate the ratings.
The first line of the input contains two space separated integers R M. R is the number of lines of user-movie ratings and M is the number of movies. Next R lines contain the ratings. Each line contains three space-separated values (user-id, movie-id, rating). Next M lines contain the metadata. The first entry in each line is the movie id, followed by words describing movie metadata. The last five lines contain two space-separated integers (target user-id, target movie-id) for which you need to estimate the ratings. You should submit your code on compass and the code outputs in the pdf submitted to gradescope.