Starting from:

$30

DSCI553 Market Basket Analysis-Solved

2.2  Programming Environment 
Python 3.6, JDK 1.8, Scala 2.11, and Spark 2.4.4

We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum.

TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.

3. Datasets  
In this assignment, you will use one simulated dataset and one real-world. In task 1, you will build and test your program with a small simulated CSV file that has been provided to you.  

Then in task2 you need to generate a subset using the Ta Feng dataset (https://bit.ly/2miWqF​S)​ with a structure similar to the simulated data. 

Figure 1 shows the file structure of task1 simulated csv, the first column is user_id and the second column is business_id. 
                                               
Figure 1: Input Data Format  

you will implement SON Algorithm ​   ​to solve all tasks (Task 1 and 2) on top of Spark Framework. You need to find all the possible combinations of the frequent itemsets  in any given input​  file within the required time. You can refer to the Chapter 6 from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or PCY algorithm to process each chunk of the data)  

4.1 Task 1: Simulated data 
There are two CSV files (small1.csv and small2.csv) on the Blackboard. The small1.csv is just a test file that you can use to debug your code. For task1, we will only test your code on​  small2.csv.​  

In this task, you need to build two kinds of market-basket models.  

Case 1
You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each user containing the business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider this product was rated only once. More specifically, the business ids within each basket are unique. The generated baskets are similar to:  user1: [business11, business12, business13, ...]  user2: [business21, business22, business23, ...]  user3: [business31, business32, business33, ...]  

Case 2
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each business containing the user ids that commented on this business. Similar to case 1, the user ids within each basket are unique. The generated baskets are similar to:  business1: [user11, user12, user13, ...]  business2: [user21, user22, user23, ...]  business3: [user31, user32, user33, ...]  Input format: 

1.  Case number: Integer  that specifies the case. ​  1​ for Case 1 and 2 for Case 2.​ 

2.  Support: ​Integer ​that defines the minimum count to qualify as a frequent itemset. 

3.  Input file path: This is the path to the input file including path, file name and extension. 

4.  Output file path: This is the path to the output file including path, file name and extension.  Output format: 

1.  Runtime: ​the total execution time from loading the file till finishing writing the output file ​You need to ​print the runtime in the console ​with the “Duration” tag, e.g., “Duration: 100”.  

2.  Output file:  

(1)  Intermediate result 
You should use “Candidates:”as the tag. For each line you should output the candidates of frequent itemsets you found after the first pass of ​SON Algorithm ​followed by an empty line after each combination. The printed itemsets must be sorted in ​lexicographical ​order (Both user_id and business_id are type of string).  

(2)  Final result 
You should use “Frequent Itemsets:”as the tag. For each line you should output the final frequent itemsets you found after finishing the ​SON Algorithm​. The format is the same with the intermediate results. The printed itemsets must be sorted in ​lexicographical ​order. 

Here is an example of the output file:  

Both the intermediate results and final results should be saved in ONE output result file. 

Execution example:  
Python: spark-submit task1.py <case number> <support> <input_file_path> <output_file_path>  Scala: ​spark-submit --class task2 hw2.jar ​<case number> <support> <input_file_path> <output_file_path>  

 ​4.2 Task 2: Ta Feng data  
In task 2, you will explore the Ta Feng dataset to find the frequent itemsets (​only case 1​). You will use data found here from Kaggle (​https://bit.ly/2miWqFS​) to find product IDs associated with a given customer ID each day. Aggregate all purchases a customer makes within a day into one basket. In other words, assume a customer purchases at once all items purchased within a day.  

N.B.: Be careful when reading the csv file, as spark can read the product id numbers with leading zeros. You can manually format Column F (PRODUCT_ID) to numbers (with zero decimal places) in the csv file before reading it using spark.  SON Algorithm on Ta Feng data:  

You will create a data pipeline where the ​input is the raw Ta Feng data​, and the ​output is the file described under “output file”​. You will pre-process the data, and then from this pre-processed data, you will create the final output. Your code is allowed to output this pre-processed data during execution, but you should ​NOT ​submit homework that includes this pre-processed data.  

(1) Data preprocessing  

You need to generate a dataset from the Ta Feng dataset with following steps:  

1.  Find the date of the purchase (column TRANSACTION_DT), such as December 1, 2000 (12/1/00)  

2.  At each date, select “CUSTOMER_ID” and “PRODUCT_ID”.  

3.  We want to consider all items bought by a consumer each day as a separate transaction (i.e., 

“baskets”). For example, if consumer 1, 2, and 3 each bought oranges December 2, 2000, and consumer 2 also bought celery on December 3, 2000, we would consider that to be 4 separate transactions. An easy way to do this is to rename each CUSTOMER_ID as “DATE-CUSTOMER_ID”. For example, if COSTOMER_ID is 12321, and this customer bought apples November 14, 2000, then their new ID is “11/14/00-12321”  

4.  Make sure each line in the CSV file is “DATE-CUSTOMER_ID1, PRODUCT_ID1”.  

5.  The header of CSV file should be “DATE-CUSTOMER_ID, PRODUCT_ID”  

You need to save the dataset in CSV format. Figure below shows an example of the output file ​(please note DATE-CUSTOMER_ID and PRODUCT_ID are strings and integers, respectively)  

Figure: customer_product file  
(2) Apply SON Algorithm  
The requirements for task 2 are similar to task 1. However, you will test your implementation with the large dataset you just generated. For this purpose, you need to report the total execution time. For this execution time, we take into account also the time from reading the file till writing the results to the output file. You are asked to find the candidate and frequent itemsets (​similar to the previous task​) using the file you just generated. The following are the steps you need to do:  

1.  Reading the customer_product CSV file in to RDD and then build the case 1 market-basket model;  

2.  Find out qualified customers-date who purchased more than ​k ​items. (​k ​is the filter threshold); 

3.  Apply the ​SON Algorithm ​code to the filtered market-basket model;  Input format: 

1.  Filter threshold: ​Integer that is ​used to filter out qualified users 

2.  Support: Integer that defines the minimum count to qualify as a frequent itemset.​              
3.  Input file path: This is the path to the input file including path, file name and extension.  

4.  Output file path: This is the path to the output file including path, file name and extension.  Output format: 

1.  Runtime: ​the total execution time from loading the file till finishing writing the output file ​You need to ​print the runtime in the console ​with the “Duration” tag, e.g., “Duration: 100”.  

2.  Output file  
The output file format is the same with task 1. Both the intermediate results and final results should be saved in ONE output result file. 

Execution example:  
Python: spark-submit task2.py <filter threshold> <support> <input_file_path> <output_file_path>  Scala: ​ spark-submit --class task2 hw2.jar​ ​ <​ filter threshold> <support> <input_file_path> <output_file_path>  6. Evaluation Metric 

Task 1: 
Input File 
Case 
Support 
Runtime (sec) 
small2.csv 


<=200 
small2.csv 


<=100 
Task 2: 

Input File 
Filter Threshold 
Support 
Runtime (sec) 
Customer_product.csv 
20 
50 
<=500 
 
Example situations

Task
Score for Python
Score for Scala

(10% of previous column if correct)
Total
Task1
Correct: 3 points
Correct: 3 * 10%
3.3
Task1
Wrong: 0 point
Correct: 0 * 10%
0.0
Task1
Partially correct:  1.5 points
Correct: 1.5 * 10%
1.65
Task1
Partially correct:  1
Wrong: 0
1.5

More products