This assignment contains two parts. First, you will implement a Model-based Collaborating Filtering(CF) recommendation system using Spark MLlib. Second, you will implement either a User-based CF system or Item-based CF system without using a library. The dataset you are going to use are the Yelp challenge dataset. The task sections below will explain the assignment instructions in detail. The goal of the assignment is to make you understand how different types of recommendation systems work and more importantly, try to find a way to improve the accuracy of the recommendation system yourself.
Environment Requirements
Python: 2.7 Scala: 2.11 Spark: 2.3.1
Student can use Python or Scala to complete both Task1 and Task2.
You can only use Spark RDD.
In order for you to understand more deeply of the Spark, use RDD only, you won’t get any point if you use Dataframe or Dataset.
Data
In this assignment, we will use the yelp challenge dataset, the ”yelp challenge dataset” can be download from this link: Yelp Challenge. In order to download the dataset, you need to use your email to sign up individually in the Yelp challenge website. Detailed introduction of the data can also be found through the link, in the document tab. After download and unzip the data, the dataset contain 6 .json file and two .pdf file. In this assignment, the reviews.json file and three columns in the review file will be used: user id, business id, stars.
The yelp dataset contains more than 6 million review record between millons of the user and business. Because the huge volume and the sparseness between the user and business, the recommendation system will take a lot of computation, we extract the subset of the whole dataset so that the assignment can end on a reasonable time for every students’ laptop. In order to finish this assignment, you only need the two data file in the Data/ folder.
We recommend you download the whole dataset as the playground of the data mining or any other area.
Yelp Dataset Description
yelp academic dataset business.json : 188,593 records
Attributes: Business ID, address, name, city, Business hours, Categories, rating and reviews count
yelp academic dataset review.json : 5,996,996 records Attributes: review ID, user ID, business ID, rating, comments yelp academic dataset user.json : 1,518,169 records Attributes: user ID, name, review count, Yelp join date yelp academic dataset checkin.json : 157,075 records Attributes: Business ID, time yelp academic dataset tip.json : 1,185,348 records Attributes: user ID, Business ID, text, likes, date yelp academic dataset photo.json : 280,992 records Attributes: photo ID, Business ID, text
Dataset for Assignment
In this assignment, we extract the subset of the whole dataset contains 452353 reviews between 30,000 user and 30,000 business and split them to train data (90%) and test data (10%). you can get two files in the Data/: train review.csv and test review.csv, each file contain three conlumns: user id, business id, and stars. And we will use these two files to finish and test our recommendation system.
Task of Recomendation System
The task of this the recommendation system is to use the records in the train.csv to predict the stars for users and businesses in the test.csv. Then, you need to use the stars in testing data as the ground truth to evaluate the accuracy of your recommendation system.
Example: Assuming train.csv contains 1 million records and the test.csv contains two records: (12345, 2, 3) and (12345, 13, 4). You will use the records in the train.csv to train a recommendation system (1 million). Finally, given the user id 12345 and business id 2 and 13, your system should produce rating predictions as close as 3 and 4, respectively.
1 Task1: Model-based CF Algorithms
In task1, you are required to implement a Model-based CF recommendation system by using Spark MLlib. You can learn more about Spark MLlib by this link: MLlib
You are going to predict the testing datasets mentioned above. In your code, you can set the parameters yourself to reach a better performance. You can make any improvement to your recommendation system: speed, accuracy.
After achieving the prediction for ratings, you need to compare your result to the correspond ground truth and compute the absolute differences. You need to divide the absolute differences into 5 levels and count the number of your prediction for each level as following:
=0 and <1: 12345 (there are 12345 predictions with a < 1 difference from the ground truth)
=1 and <2: 123
=2 and <3: 1234
=3 and <4: 1234
=4: 12
Additionally, you need to compute the RMSE (Root Mean Squared Error) by using following formula:
Where Predi is the prediction stars for business i, Ratei is the true stars for business i, n is the total number of the review. Read the Microsoft paper mentioned in class to know more about how to use RMSE for evaluating your recommendation system.
Tips: For model-based CF, you may need to index the user id and business id to integer.
BaseLine
After implementing the model-based CF, you can try to change the parameters of the model and see the change of the RMSE of the recommendation. You need to find the parameters that can beat the baseline to get the full grade. Here is the baseline of the Model-Based CF:
RMSE = 1.08
2 User-based or Item-based CF Algorithm
In this part, you are required to implement a User-based CF or Item-based recommendation system with Spark. For the detail of the User-Based CF and the Item-Based CF, you can find it from the slides of the lecture or from many tutorial from the Internet.
You are going to predict for the testing datasets mentioned above. Based on the User-based or Item-based CF, you can make any improvement to your recommendation system: speed, accuracy (e.g., Hybird approaches). It’s your time to design the recommendation system yourself, but first you need to beat the baseline.
After achieving the prediction for ratings, you need to compute the accuracy in the same way mentioned in Model-based CF.
Result Format
1. Save the predication results in a text file. The result is ordered by user id and business Id in ascending order.
Example Format:
user1, business2, prediction12 user1, business3, prediction13 ... usern, businessk, predictionnk
2. Print the accuracy information in terminal, and copy this value in your description file.
=0 and <1: 12345 =1 and <2: 123
=2 and <3: 1234
=3 and <4: 1234 =4: 12
RMSE: 1.23456789
Time: 123 sec
Baseline & Time Threshold
Same as the Model-Based CF, in order to get the full point of the grade, you need to beat the baseline first. And this task has the time threshold, make sure that your program can give the result within a reasonable time.
You can use any method (based on user-based or item-based CF) to improve the performance of your recommendation system, for example, you can find someway to refine the result from the User-Based or Item-Based CF, or combine the result from User-Based and Item-Based CF.
RMSE = 1.11
Time Threshold : 450 Second
Execution Example
The first argument passed to our program (in the below execution) is the training csv file. The second input is the testing csv file. Following we present examples of how you can run your program with spark-submit both when your application is a Java/Scala program or a Python script.
Example of running application with spark-submit
Notice that the argument class of the spark-submit specifies the main class of your application and it is followed by the jar file of the application.
Please use ModelBasedCF, UserBasedCF, ItemBasedCF as class name
Figure 2: CF: Command Line Format for Scala
Figure 3: CF: Command Line Format for python
You don’t need to specify the path of the output file in the commandline, you only need to save the file with the name format Firstname Lastname XXXXBasedCF.txt. in the same path your program run (Relative Path).
Description File
Please include the following content in your description file:
1. Mention the Spark version and Python version
2. Describe how to run your program for both tasks
3. Same baseline table as mentioned in task 1 to record your accuracy andrun time of programs in task 2
4. If you make any improvement in your recommendation system, pleasealso describe it in your description file.