Starting from:

$30

COL761-Data_Mining_Assignment1 Solved

1. Mention your github repo. Make sure that this is the same github repo as what you mentioned in HW0.

2. This question is on frequent itemset mining. Implement Apriori Algorithm to mine frequent itemsets. Apply it on the Dataset: http://fimi.uantwerpen.be/data/webdocs.dat.gz. You may assume all items are integers.

a. Please name your bash file RollNo.sh. For example, if MCS162913 is your roll number, your file should be named MCS162913.sh. Executing the command “./RollNo.sh -apriori retail.dat X <filename>” should generate a file filename.txt containing the frequent itemsets at >=X% support threshold with the apriori algorithm. Notice that X is in percentage and not the absolute count. Your implementations must ensure that the transactions are not loaded into main memory. This means, that it is not allowed to parse the complete input data and save it into an array or similar data structure. However, the frequent patterns and candidate sets can be stored in memory.

filename.txt should strictly follow the following format.

                                                               i.      Each frequent itemset must be on a new line.

                                                             ii.      The items must be space separated and in ascending order of ASCII code.

Your grade will be (F-score)*20.

b. Compare the performance of your implemented apriori algorithm with the following FP-tree implementation  https://borgelt.net/fpgrowth.html (download package “fpgrowth.zip” and unzip it, cd fpgrowth/fpgrowth/src, make all,  run using ./fpgrowth -sSUPPORT% inputfile outfile. For more details visit https://borgelt.net/doc/fpgrowth/fpgrowth.html). Executing the command “./RollNo.sh retail.dat -plot” should generate a plot using matplotlib where the x axis varies the support threshold and y axis contains the corresponding running times. It should plot the running times of FP-tree and Apriori algorithms at support thresholds of 5%, 10%, 25%, 50%, and 90%. Explain the results that you observe. You may add a timeout if your code fails to finish even after 1 hour.

 

3.  Implement PrefixSpan. Use it on the finished paths sequence dataset at SNAP: Web data: Wikispeedia navigation paths (stanford.edu). In this dataset, every itemset within a sequence is of length 1. You will need to extract only the sequence of page titles and ignore all remaining fields (only the column “path” is of significance). Note that the ‘<’ symbol means a back-click. You would to add a pre-processing step further to add relevant page title when seeing a back-click. Executing the command “./RollNo.sh -prefixspan retail.dat X <filename>” should generate a file filename.txt containing the frequent sub-sequences at >=X% support threshold. Notice that X is in percentage and not the absolute count.

filename.txt should strictly follow the following format.

                                                               i.      Each frequent itemset must be on a new line.

                                                             ii.      The items must be space separated and in ascending order of ASCII code.

 

Bash scripts you need to provide:

·          compile.sh that compiles your code with respect to all implementations. Specifically running ./compile.sh in your submission folder should create all the binaries that you require. Any optimization flags like O3 for g++ should be included here itself

·         RollNo.sh as specified earlier

·         install.sh that should execute cloning of your team’s repository in the current directory followed by executing bash commands to load all the required HPC modules. Inside the repository we should be able to locate EntryNo-Assgn1.zip corresponding to your Homework 1 submission. Note this script will be run as source install.sh (to make use of HPC module load alias). 

More products