$25
In this homework we, as the manager of a grocery store, we ask you to develop a tool which will help us to improve our marketing strategy. We will provide you the purchased transactions of our customers as an input file and your tool needs to extract all the sales relations between the products. For this purpose, you are required to use a templated hashtable with separate chaining .
Overview
When we go shopping, we generally have a standard list of things to purchase. This list is prepared based on one's needs and preferences for each customer. A housewife might purchase fresh food like vegetables or milk for the kitchen whereas someone alone might purchase chips and coke. Understanding these buying interests can offer assistance to extend deals in a few ways.
For instance, if X and Y are frequently bought together:
● X and Y can be both placed on the same shelf, so that buyers of one item would be prompted to buy the other.
● Promotional discounts could be applied to just one out of the two items (X or Y) since buying one would encourage the buyer to buy the other one, too.
● Advertisements on X could be targeted at buyers who purchase Y.
The Goal
The goal of this homework, for a given list of transactions, is to extract all possible implications. For instance;
chips - coke
This implies that, for the customers who bought chips most likely bought coke, too. Or, as another example, consider the following implication;
sugar, milk - flour
Implies that, the customers who bought sugar and milk most likely bought flour, too.
Input File Format
Your program needs to ask two numbers (ranging from 0 and 1) for given support and confidence threshold values (see their definitions in the document for details below) from the user. Then, your program needs to ask the name of the input file. You can decide on the appropriate display messages for them.
The input file contains the list of purchased transactions of our customers. In each line, there are several products (items) that are purchased for that time of shopping. For a transaction, there is no repeated product and the number of products are not fixed. An example of input file is given as below:
inputFile.txt
milk bread bananas cake bread apple onion coke chips sandwich bananas yoghurt chocolate grapes orange milk orange bread
In the input file, there are no empty input lines and all of the words are given as in lower cases. The words in the same line are separated with an empty space. An item will be only one word, so, you will not see an item containing multiple words such as “orange juice”.
Output File Format
The output file format needs to be exactly the same as described here. You need to write your outputs to a file named as results.txt . In each line there will be only one rule. For each rule, you need to put commas between the products of the same side of the implication and you need to put a semicolon instead of the implication symbol (- ). So, basically, what we expect from you is to generate the file as given in the second column of the following table if the extracted implications are as given in the first column of the table. There will be no space character in the output file. Moreover, you need to give the confidence values of rules in the output file. Round your confidence values having 2 decimals, i.e., having 2 digits after the dot as below. Continue reading this document for more details on confidence values.
Extracted Rules with Confidences
A - B conf = 0.30123
{B, C} - D conf = 0.12139
{B, C} - {A, D} conf = 0.32125
Output file format
A;B;0.30
B,C;D;0.12
B,C;A,D;0.32
Background Information
Before going into detail with the algorithm, we need to define two concepts: support and confidence .
Support describes how an item (or set of items) are popular among the whole item set. Hence, the support value of an item (or set of items) is the ratio of the number of times the item appeared among all the transactions. For instance, if there are 10 transactions in the list and milk is purchased 4 times, the support value of milk is 0.4. On the other hand, if you need to calculate the support value of several items together, such as milk and bread, you need to find the number of transactions which contain both milk and bread together to find the support value of them.
Confidence value describes how likely item Y is purchased when item X is purchased, expressed as X-Y . This is calculated as:
confidence(X - Y) = support(X, Y) / support(X)
For an implication having multiple items, such as {X, Y} - {Z, W}, confidence is calculated as:
confidence({X, Y} - {Z, W}) = support(X, Y, Z, W) / support(X, Y)
For instance; for the given support values;
support(bread) = 0.5 support(milk) = 0.4 support(bread, milk) = 0.2
The confidence of {bread - milk} is calculated as:
confidence(bread - milk) = support(bread, milk) / support(bread) = 0.2 / 0.5 = 0.4
Whereas the confidence of {milk - bread} becomes,
confidence(milk-bread) = support(milk, bread) / support(milk) = 0.2 / 0.4 = 0.5
The Flow of the Program
The program has 2 main parts: in the first part, you need to find the support value of several item sets such as support({bread}) and support({bread, milk}) . Then, by using those support values, you need to calculate the confidence values and extract the rules.
Calculating Support Values
● Find all support values of items in the transaction such as support({bread}) and support({milk}).
● Store all the items whose support value is smaller than the given support threshold value.
● Find all possible item pairs of the remaining items (selected after the previous step).
● Find all support values of item pairs in the transaction such as support({bread, milk}) , support({apple,orange}).
● Store all the items pairs whose support value is smaller than the given support threshold value.
Rule Extraction Process
Up to this point, you have discovered all items and item pairs whose support values are greater than a given support threshold value. Assume that all of these items and item pairs are stored in the same hash table named as lookupTable .
Using lookupTable (a hash table) you need to enumerate all possible 2 way permutations of all the lookupTable elements. For example, assume that in the hash table, we have the following elements:
{A}, {B}, {C}, {A, B} and { B, C}
You need to find all possible 2 length permutations of these elements and construct the rules as below:
A - B
A - C
A - {B, C}
B - A
B - C
C - A
C - B
C - {A, B}
{A, B} - C
However note that, there can not be the same item on both sides of the implication. For example, the following rules are not valid .
A - {A, B}
{A, B} - {B, C}
{A, B} - A
After enumerating all possible rules, you need to calculate the confidence values of them. The output of your program will be the rules whose support values are greater the given confidence threshold value.
Sample Runs
in.txt
milk bread cake bread apple onion coke chips sandwich
bananas yoghurt chocolate grapes orange milk orange bread yoghurt bananas juice apple cucumber garlic tea egg melon
bread milk juice bananas orange milk tea bread bread milk chocolate milk coffee
broccoli bananas orange bread milk
Sample Run 1
For the given input file above, the output will be:
Please enter the transaction file name: in.txt
Please enter support and confidence values between 0 and 1: 0.20 0.60
14 rules are written to results.txt
results.txt
bananas;orange;0.75 orange;bananas;0.75 bread,orange;bananas;0.66 orange,milk;bananas;0.66 orange;bread;0.75 bread;milk;0.85 milk;bread;0.85 bananas,orange;bread;0.66 orange,milk;bread;1.00 orange;milk;0.75 orange;bread,milk;0.75 bananas,orange;milk;0.66 bread,orange;milk;1.00 bananas,orange;bread,milk;0.66
Sample Run 2
For the given same input file above, the output will be:
Please enter the transaction file name: in.txt
Please enter support and confidence values between 0 and 1: 0.60 0.45 There is no rule for the given support and confidence values.