Starting from:

$30

DSCI553-Recommendation System Optimization Solved

1. Overview of the Assignment
In this competition project, you need to significantly improve the performance of your recommendation system in Assignment 3. You can use any method (like the hybrid recommendation systems) to improve the prediction accuracy and efficiency.

 

2. Competition Requirements
2.1 Programming Language and Library Requirements 

a.                   You must use Python to implement the competition project. You can use external Python libraries as long as they are available on Vocareum.

 

b.                   You are required to only use the Spark RDD to understand Spark operations. You will not receive any points if you use Spark DataFrame or DataSet.  

 

2.2  Programming Environment 

Python 3.6.4, Scala 2.11, JDK 1.8 and Spark 2.4.4 

We will use these library versions to compile and test your code. There will be a 20% penalty if we cannot run your code due to the library version inconsistency. 

 

2.3  Write your own code 

Do not share your code with other students!! 

We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.

 

3. Yelp Data  
In this competition, the datasets you are going to use are from:

https://drive.google.com/drive/folders/1SIlY40owpVcGXJw3xeXk76afCwtSUx11?usp=sharing 

We generated the following two datasets from the original Yelp review dataset with some filters. We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset.

A.  yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.

B.  yelp_val.csv: the validation data, which are in the same format as training data.

C.  We are not sharing the test dataset.

D.  other datasets: providing additional information (like the average star or location of a business)

a.  review_train.json: review data only for the training pairs (user, business)  

b.  user.json: all user metadata  

c.   business.json: all business metadata, including locations, attributes, and categories  

d.  checkin.json: user check-ins for individual businesses  

e.  tip.json: tips (short reviews) written by a user about a business  

f.    photo.json: photo data, including captions and classifications  

 

4. Task 
In the competition, you need to significantly improve the performance of your recommendation system in Assignment 3. You can mine interesting and useful information from the datasets provided in the Google Drive folder to support your recommendation system.  

 

You must make a significant improvement to your recommendation system in terms of accuracy. You can utilize the validation dataset (yelp_val.csv) to evaluate the accuracy of your recommendation system. There are two options to evaluate your recommendation system:

(1)  Error Distribution: You can compare your results to the corresponding ground truth and compute the absolute differences. You can divide the absolute differences into 5 levels and count the number for each level as following:

>=0 and <1: 12345

>=1 and <2: 123

>=2 and <3: 1234

>=3 and <4: 1234

>=4: 12

This means that there are 12345 predictions with < 1 difference from the ground truth. This way you will be able to know the error distribution of your predictions and to improve the performance of your recommendation systems.

(2)  RMSE Error: You can compute the RMSE (Root Mean Squared Error) by using following formula:

  where Predi is the prediction for business i and Ratei is the true rating for business i. n is the total number of the business you are predicting.

 

Input format: (we will use the following commands to execute your code) 

./spark-submit competition.py <folder_path> <test_file_name> <output_file_name>

Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive  Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path  

Param: output_file_name: the name of the prediction result file, including the file path

 

Output format: 

a.                   The output file is a CSV file, containing all the prediction results for each user and business pair in the validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the order in this task. There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 1. 

  

Figure 1: Output example in CSV  

b.                   You also need to write comments that include the description of your method (less than 300 words) in the first part of your program. The description should explain the models you are using, especially the way you improve the accuracy or efficiency of the system. We look forward to seeing creative methods. Please also report the error distribution, RMSE, and the total execution time on the validation dataset in the description. Figure 2 shows an example of the description file. If the comments are not included or the comments are not informative, there will be a one-point penalty. 

  

Figure 2: An example of description file

More products