Starting from:

$30

DSCI553-Recommendation System for Yelp Solved

4.1  Task1: Min-Hash + LSH
In this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccard similarity to find similar business pairs in the train_review.json file. We focus on 0/1 ratings rather than the actual rating values in the reviews. In other words, if a user has rated a business, the user’s contribution in the characteristic matrix is 1; otherwise, the contribution is 0 (Table 1). Your task is to identify business pairs whose Jaccard similarity is >= 0.05. 

Table 1: The left table shows the original ratings; the right table shows the converted 0 and 1 ratings. 

You can define any collection of hash functions to permutate the row entries of the characteristic matrix to generate Min-Hash signatures. Some potential hash functions are:

𝑓(𝑥) = (𝑎𝑥 + 𝑏)    %            𝑚 𝑓(𝑥) = ,(𝑎𝑥 + 𝑏)            %            𝑝.            %            𝑚
where 𝑝 is any prime number; 𝑚 is the number of bins. You can define any combination for the parameters (𝑎, 𝑏, 𝑝, or 𝑚) in your implementation.

After you have defined all hash functions, you will build the signature matrix using Min-Hash. Then you will divide the matrix into 𝒃 bands with 𝒓 rows each, where 𝒃 × 𝒓 = 𝒏 (𝒏 is the number of hash functions). You need to set 𝒃 and 𝒓 properly to balance the number of candidates and the computational cost. Two businesses become a candidate pair if their signatures are identical in at least one band.

Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows an example of calculating the Jaccard similarity between two businesses. Your final outputs will be the business pairs whose Jaccard similarity is >= 0.05.

                                                              user1               user2              user3               user4 

business1 




business2 




Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/3 

4.1.2 Execution commands

Python $ spark-submit task1.py <input_file> <output_file>

    Scala     $ spark-submit --class task1 hw3.jar <input_file> <output_file>

<input_file>: the train review set

<output_file>: the similar business pairs and their similarities

4.1.3 Output format
You must write a business pair and its similarity in the JSON format using exactly the same tags like the example in Figure 1. Each line represents a business pair, e.g., “b1” and “b2”. For each business pair “b1” and “b2”, you do not need to generate the output for “b2” and “b1” since the similarity value is the same as “b1” and “b2”.  You do not need to truncate decimals for the ‘sim’ values.

Figure 1: An example output for Task 1 in the JSON format
dating a Boolean vector for representing the user profile by aggregating the profiles of the items that the user has reviewed

During the prediction process, you will estimate if a user would prefer to review a business by computing the cosine distance between the profile vectors. The (user, business) pair is valid if their cosine similarity is >= 0.01. You should only output these valid pairs.  

4.2.2 Execution commands Training commands:  
Python $ spark-submit task2train.py <train_file> <model_file> <stopwords>

    Scala     $ spark-submit --class task2train hw3.jar < train_file> <model_file> <stopwords>

<train_file>: the train review set         <model_file>: the output model

<stopwords>: containing the stopwords that can be removed

Predicting commands:
Python $ spark-submit task2predict.py <test_file> <model_file> <output_file>

    Scala     $ spark-submit --class task2predict hw3.jar <test_file> <model_file> <output_file>

<test_file>: the test review set (only target pairs)  

         <model_file>: the model generated during the training process <output_file>: the output results

4.2.3 Output format:
Model format:  There is no strict format requirement for the content-based model.  

Prediction format:p
You must write the results in JSON format using exactly the same tags like the example in Figure 2. Each line represents a predicted pair of (“user_id”, “business_id”). You do not need to truncate decimals for ‘sim’ values.

Figure 2: An example prediction output for Task 2 in JSON format 
During the training process, you should combine the Min-Hash and LSH algorithms in your user-based CF recommendation system since the number of potential user pairs might be too large to compute. You need to (1) identify user pairs’ similarity using their co-rated businesses without considering their rating scores (similar to Task 1). This process reduces the number of user pairs you need to compare for the final Pearson correlation score. (2) compute the Pearson correlation for the user pair candidates with Jaccard similarity >= 0.01 and at least three co-rated businesses. The predicting process is similar to Case 1.

4.3.2 Execution commands Training commands:
Python $ spark-submit task3train.py <train_file> <model _file> <cf_type>

    Scala     $ spark-submit --class task3train hw3.jar < train_file> <model _file> <cf_type>  

<train_file>: the train review set         <model_file>: the output model

<cf_type>: either “item_based” or “user_based”

Predicting commands:

Python $ spark-submit task3predict.py <train_file> <test_file> <model_file> <output_file> <cf_type>

    Scala     $ spark-submit --class task3predict hw3.jar <train_file> <test_file> <model_file>

<output_file> <cf_type>

<train_file>: the train review set

<test_file>: the test review set (only target pairs)

                  <model_file>: the model generated during the training process

<output_file>: the output results

<cf_type>: either “item_based” or “user_based”

4.3.3 Output format:

Model format:
You must write the model in JSON format using exactly the same tags like the example in Figure 3. Each line represents a business pair (“b1”, “b2”) for the item-based model (Figure 3a) or a user pair (“u1”, “u2”) the for user-based model (Figure 3b). There is no need to have (“b2”, “b1”) or (“u2”, “u1”). You do not need to truncate decimals for ‘sim’ values.
Figure 3: (a) is an example of item-based model and (b) is an example of user-based model 

Prediction format:
You must write a target pair and its prediction in the JSON format using exactly the same tags like the example in Figure 4. Each line represents a predicted pair of (“user_id”, “business_id”). You do not need to truncate decimals for ‘stars’ values.

Figure 4: An example output for task3 in JSON format
Besides, we will compare your prediction results against the ground truth in both test and blind datasets. You should output the predictions ONLY generated from the model. Then we use RMSE (Root Mean Squared Error) defined in the equation below to evaluate the performance. For those pairs that your model cannot predict (e.g., due to cold start problem or too few co-rated users), we will predict them with the business average stars for the item-based model and the user average stars for the user-based model. We provide two files contain the average stars for users and businesses in the training dataset, respectively. The value of “UNK” tag, which can be used for predicting those new businesses and users, is the average stars for the whole reviews.  

𝑅𝑀𝑆𝐸 = 7𝑛 :(𝑃𝑟𝑒𝑑! − 𝑅𝑎𝑡𝑒!)

Where 𝑃𝑟𝑒𝑑! is the prediction for business 𝑖 and 𝑅𝑎𝑡𝑒! is the true rating for business 𝑖. 𝑛 is the total number of the user and business.  

The execution time of the training process on Vocareum should be less than 600 seconds. The execution time of the predicting process on Vocareum should be less than 100 seconds. RMSE for the item-based model in both test and blind datasets should be <=0.91), and for the user-based model in both datasets should be <=1.01 . If the performance of only either one dataset reaches the threshold, you will obtain 1pt.

More products