Starting from:

$30

DSCI553-Foundations and Applications of Data Mining 3 solved

1.   Overview of the Assignment
In Assignment 3, you will complete two tasks. The goal is to familiarize you with Locality Sensitive Hashing (LSH), and different types of collaborative-filtering recommendation systems. The dataset you are going to use is a subset from the Yelp dataset used in the previous assignments.


3.   Yelp Data  
In this assignment, the datasets you are going to use are from:

https://drive.google.com/drive/folders/1SufecRrgj1yWMOVdERmBBUnqz0EX7ARQ?usp=shar ing 

We generated the following two datasets from the original Yelp review dataset with some filters. We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset.

a.  yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.

b.  yelp_val.csv: the validation data, which are in the same format as training data.

c.   We are not sharing the test dataset.

d.  other datasets: providing additional information (like the average star or location of a business)

 

4.   Tasks
4.1  Task1: Jaccard based LSH 
In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using yelp_train.csv.​

In this task, we focus on the “0​ or 1” ratings rather than the actual ratings/stars from the users. Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t rated the business, the contribution is 0. You​ need to identify similar businesses whose similarity >= 0.5. 

You can define any collection of hash functions that you think would result in a consistent permutation of the row entries of the characteristic matrix. Some potential hash functions are:

                                                    f(x)= (ax + b) % m      or     ​   f(x) = ((ax + b) % p) % m​      

where p is any prime number and m is the number of bins. Please carefully design your hash functions.​  

After you have defined all the hashing functions, you will build the signature matrix. Then you will divide the matrix into b​ bands with r​ rows each, where b​ x r = n (n is the number of hash functions)​. You​ should carefully select a good combination of b​ and r​ ​in your implementation (b>1 and r>1).​ Remember that two items are a candidate pair if their signatures are identical in at least one band.

Your final results will be the candidate pairs whose original Jaccard similarity is >=​               0.5.​ You need to write the final results into a CSV file according to the output format below.

Example of Jaccard Similarity:

 
               user1                               user2                               user3 
user4 
business1 




business2 




Ja
ccard Similarity (business1, business2) = #intersection / #union = 1
/3 
 

Input format: (we will use the following command to execute your code) 

./spark-submit task1.py <input_file_name> <output_file_name>

Param: input_file_name: the name of the input file (yelp_train.csv), including the file path.  Param: output_file_name: the name of the output CSV file, including the file path.  

 

Output format: 

IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We will not regrade because of formatting issues. 

a. The output file is a CSV file, containing all business pairs you have found. The header is “business_id_1, business_id_2, similarity”. Each pair itself must be in the alphabetical order. The entire file also needs to be in the alphabetical order. There is no requirement for the number of decimals for the similarity value. Please refer to the format in Figure 2. 

  

Figure 2: a CSV output example for task1

 

4.2.1. Item-based CF recommendation system  

Please strictly follow the slides to implement an item-based recommendation system with Pearson similarity.

4.2.2. Model-based recommendation system 

You need to use XGBregressor(a regressor based on the decision tree) to train a model. You need to use this API https://xgboost.readthedocs.io/en/latest/python/python_api.htm​  l,​ the XGBRegressor inside package xgboost.  

Please choose your own features from the provided extra datasets and you can think about it with customer thinking. For example, the average stars rated by a user and the number of reviews most likely influence the prediction result. You need to select other features and train a model based on that. Use the validation dataset to validate your result and remember don’t include it into your training data.

4.2.3. Hybrid recommendation system. 

Now that you have the results from previous models, you will need to choose a way from the slides to combine them together and design a better hybrid recommendation system.

Here are two examples of hybrid systems:

Example1: 

You can combine them together as a weighted average, which means:

The key idea is: the CF focuses on the neighbors of the item and the model-based  RS focuses on the user final score = α×scoreitem_based + (1 − α)×scoremodel_based

and items themselves. Specifically, if the item has a smaller number of neighbors, then the weight of the CF should be smaller. Meanwhile, if two restaurants both are 4 stars and while the first one has 10 reviews, the second one has 1000 reviews, the average star of the second one is more trustworthy, so the model-based RS score should weigh more. You may need to find other features to generate your own weight function to combine them together.

Example2: 

You can combine them together as a classification problem:

Again, the key idea is: the CF focuses on the neighbors of the item and the model-based RS focuses on the user and items themselves. As a result, in our dataset, some item-user pairs are more suitable for the CF while the others are not. You need to choose some features to classify which model you should choose for each item-user pair.

If you train a classifier, you are allowed to upload the pre-trained classifier model named “model.md” to save running time on Vocareum. You can use pickle library, joblib library or others if you want. Here is an example: https://scikit-learn.org/stable/modules/model_persistence.htm​    l.​  You also need to upload the training script named “train.py” to let us verify your model.  

Some possible features (other features may also work):

-Average stars of a user, average stars of business, the variance of history review of a user or a business.

-Number of reviews of a user or a business.

-Yelp account starting date, number of fans.

-The number of people think a users’ review is useful/funny/cool. Number of compliments (Be careful with these features. For example, sometimes when I visit a horrible restaurant, I will give full stars because I hope I am not the only one who wasted money and time here. Sometimes people are satirical.

:-))

 

Input format: (we will use the following commands to execute your code) Case1:

./spark-submit task2_1.py <train_file_name> <test_file_name> <output_file_name>

Param: train_file_name: the name of the training file (e.g., yelp_train.csv), including the file path  

Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path  Param: output_file_name: the name of the prediction result file, including the file path Case2:

./spark-submit task2_2.py <folder_path> <test_file_name> <output_file_name>

Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.  

Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path  Param: output_file_name: the name of the prediction result file, including the file path Case3:

./spark-submit task2_3.py <folder_path> <test_file_name> <output_file_name>

Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.  

Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path  

Param: output_file_name: the name of the prediction result file, including the file path

 

 

Output format: 

a. The output file is a CSV file, containing all the prediction results for each​ user and business pair in the validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the order in this task. There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 3. 

  

Figure 3: Output example in CSV for task2

More products