$30
1 Non-negative matrix factorization
Setup
First load the dataset and import scikit-learn’s decomposition module:
import math import matplotlib.pyplot as plt import numpy as np
from sklearn.datasets import load_digits from sklearn import decomposition digits = load_digits()
X = digits["data"]/16. Y = digits["target"]
1.1 Comparison of scikit-learn’s NMF with SVD
Use the decomposition module to compare non-negative matrix factorization (NMF) with singular value decomposition (SVD, np.linalg.svd) on the digits dataset where the methods factorize X (the matrix of flattened digit images) in the following way:
X=Z · H
(NMF)
(1)
X=U · S · VT
(SVD)
(2)
X,Z,H ∈ R≥0. If X and your number of latent components is M then Z and H . Run SVD with full rank and then select the 6 rows from VT corresponding to the largest singular values. Use at least 10 components for NMF. Note that you must use centered data for SVD (but not for NMF, of course). Reshape the selected basis vectors from H and VT into 2D images and plot them. One can interpret these images as a basis for the vector space spanned by the digit dataset. Compare the bases resulting from SVD and NMF and comment on interesting obervations.
1.2 Implementation
We learned in the lecture that the NMF can be found by alternating updates of the form
ZTt X (3)
Ht+1 ← HtZTZtHt
t
XHTt+1
Zt+1 ← ZtZtHt+1HTt+1 (4)
Numerators and denominators of the fractions are matrix multiplications, whereas the divisions and multiplicative updates must be executed element-wise. Implement a function non_negative(data, num_components) that calculates a non-negative matrix factorization with these updates, where num_components is the desired number of features M after decomposition. Initialize Z0 and H0 positively, e.g by taking the absolute value of standard normal random variables (RV) with np.random.randn. Iterate until reasonable convergence, e.g. for t = 1000 steps. Note that you might have to ensure numerical stability by avoiding division by zero. You can achieve this by clipping denominators at a small positive value with np.clip. Run your code on the digits data, plot the resulting basis vectors and compare with the NMF results from scikit-learn (results should be similar). Can you confirm that the squared loss is non-increasing as a function of t?
2 Recommender system
Use your code to implement a recommendation system. We will use the movielens-100k dataset with pandas, which you can download as an “external link” on MaMPF.
import pandas as pd # install pandas via conda
#column headers for the dataset ratings_cols = [’user id’,’movie id’,’rating’,’timestamp’] movies_cols = [’movie id’,’movie title’,’release date’, ’video release date’,’IMDb URL’,’unknown’,’Action’,
’Adventure’,’Animation’,’Childrens’,’Comedy’,’Crime’,
’Documentary’,’Drama’,’Fantasy’,’Film-Noir’,’Horror’,
’Musical’,’Mystery’,’Romance ’,’Sci-Fi’,’Thriller’,
’War’ ,’Western’] users_cols = [’user id’,’age’,’gender’,’occupation’,
’zip code’]
users = pd.read_csv(’ml-100k/u.user’, sep=’|’, names=users_cols, encoding=’latin-1’)
movies = pd.read_csv(’ml-100k/u.item’, sep=’|’, names=movies_cols, encoding=’latin-1’)
ratings = pd.read_csv(’ml-100k/u.data’, sep=’\t’, names=ratings_cols, encoding=’latin-1’)
# peek at the dataframes, if you like :) users.head() movies.head() ratings.head()
# create a joint ratings dataframe for the matrix fill_value = 0 rat_df = ratings.pivot(index = ’user id’,
columns =’movie id’, values = ’rating’).fillna(fill_value) rat_df.head()
The data matrix X is called rat_df in the code. It is sparse because each user only rated a few movies. The variable fill_value = 0 determines the default value of missing ratings. You can play with this value (e.g. set it to the average rating of all movies, or to the average of each specific movie instead of a constant).
Now compute the non-negative matrix factorization. Play with the number of components m in your factorisation. You should choose m such that the reconstruction H is less sparse than the actual rating matrix. This allows the recommender system to suggest a movie to a user when that movie has not been rated in X by him/her, but is predicted in Xb to receive a high rating. Write a method to give movie recommendations for movies, which user user_id has not yet seen (or at least rated):
reconstruction = pd.DataFrame(Z @ H, columns = rat_df.columns) predictions = recommend_movies(reconstruction, user_id, movies, ratings)
You can also add some ratings for additional users (yourself) and check if the resulting recommendations make sense. Show (e.g. with histograms) that the “genre-statistics” vary between already rated movies and predicted movies (e.g. with 20 predictions) for selected users. What is the difference and how do you explain it? Also try to identify rows in H that can be interpreted as prototypical user preferences (e.g. “comedy fan”).
Sidenote: At least until around 2012, Netflix was using a similar SVD++ reconstruction together with a restricted Boltzmann machine (RBM) to give recommendations. [1]
[1] https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429