$25
Two datasets (Market, Gene) can be found on Piazza. For each dataset, we provide the transaction-item representation discussed in class—Each row denotes a transaction, and each transaction consists of a set of items.
In this assignment, you are asked to implement Apriori algorithm that discovers a collection of frequent itemsets from a transaction database.
Template is for Python 3. You are asked to fill in two functions: apriori_gen and get_freq. apriori_gen generates new candidate itemsets based on the frequent itemsets found in the previous iteration, and prunes the itemsets containing any infrequent itemset of size k-1. get_freq returns the candidate itemsets that meet a minimum support threshold. Do not change the input and output of the two functions in the template.
Please take the following steps:
1. Implement Apriori algorithm. The pseudo code can be found below. Apriori (T, minSupport){
// T is the database and minSupport is the minimum support 𝐿1 = {the set of single item whose support is not less than minSupport} for (k = 2; 𝐿𝑘−1 ≠ ∅; k++ ){
//generate and prune candidate set 𝐶𝑘
𝐶𝑘 is a list of itemsets in which each itemset is formed by merging two itemsets in 𝐿𝑘−1 if their first k-2 items are identical
Remove an itemset from 𝐶𝑘 if any (k-1)-subset of this candidate itemset is not in the frequent itemset list 𝐿𝑘−1
//Count the support of each candidate itemset for each transaction t in database do{
for each candidate 𝑐 in 𝐶𝑘
// increment the support count of all candidate itemsets that are contained in transaction t
if c is a subset of t then count[c] ← count[c] + 1
}
for each candidate 𝑐 in 𝐶𝑘
// Judge if a candidate itemset is frequent or not
if the support of c is not less than minSupport
then include c in 𝐿𝑘
} return {𝐿1, 𝐿2, … , 𝐿𝑘}
}
You cannot directly call a function or package that implements Apriori. You need to implement the algorithm by yourself. If you are not sure about whether it is OK to use a certain function, please post your question on Piazza.
2. Apply your implemented Apriori on the Market dataset with minimum support threshold=50%. You should get the following candidate itemsets (Ck) and frequent itemsets (Lk):
Candidate itemsets:
C2: {Eggs,Key-chain}, {Eggs,Mango}, {Eggs,Onion}, {Eggs,Yo-yo}, {Key-chain, Mango}, {Key-chain, Onion}, {Key-chain,Yo-yo}, {Mango,Onion}, {Mango,Yo-yo},
{Onion,Yo-yo}
C3: {Eggs,Key-chain,Onion}, {Key-chain,Mango,Onion}
Frequent itemsets:
L1: {Eggs}, {Key-chain}, {Mango}, {Onion}, {Yo-yo}
L2: {Eggs,Key-chain}, {Eggs,Onion}, {Key-chain,Mango}, {Key-chain,Onion}, {Keychain,Yo-yo}, {Mango, Onion}
L3: {Eggs, Key-chain,Onion}
3. If you get the same collection of itemsets at Step 2, you can proceed to apply your implemented Apriori algorithm on the Gene dataset with minimum support threshold=50%. You should be able to get 51 length-1 frequent itemsets (L1), 1275 length-2 candidate itemsets (C2), 29 length-2 frequent itemsets (L2), 20 length-3 candidate itemsets (C3) and 2 length-3 frequent itemsets (L3).
4. Prepare your submission. Your final submission should be a zip file named as Assignment2.zip. In the zip file, you should include:
• A folder “Code”, which contains all the codes used in this assignment. You are required to fill in the template. Do not change other functions or add other scripts.
• Report: A doc or pdf file named as Assignment2.pdf. The report should consist of the following parts: 1) The frequent itemsets you obtain on Gene dataset (L1, L2, L3). 2) The length-3 candidate itemsets generated during Apriori (C3) on Gene dataset. 3) The codes of your Apriori algorithm implementation.