Starting from:

$30

COL761-Data Mining Assignment2 Solved

1. This question is about getting yourself familiar with frequent subgraph mining tools.

Run it on the Yeast (https://bit.ly/3EVA5Cc) Dataset. This is a database of molecules. The format is the following: 

#graphID 

# of nodes 

Series of Node Labels 

# of edges 

Series of “Source node, Destination Node, Edge label” 

Run gSpan, FSG (also known as PAFI), and Gaston (you should be able to find it online) against frequency threshold in the dataset (you may need to write a script to change the format of the dataset) at minSup = 5%, 10%, 25%, 50% and 95%. Plot the running times and explain the trend observed in the running times. Specifically comment on the growth rates and why one technique is faster than the others. You are free to consult the respective papers. Libraries you can potentially use: 

•       gSpan : https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm 

•       FSG : http://glaros.dtc.umn.edu/gkhome/pafi/download 

•       Gaston : https://liacs.leidenuniv.nl/~nijssensgr/gaston/download.html 

  

2. Dataset (https://bit.ly/3odTt7B) . [Marks: for correct algorithm. for correct implementation. for correct explanation.] 

a.       Reduce to dimensions 2,4,10,20 using PCA. 

b.       Index using KD-tree and M-tree for L2 distance. You can use off-the shelf libraries if you wish to. Maintain the dataset in memory. 

c.       Write an algorithm to perform k-NN query on M-tree and KD-tree. Needless to say your algorithm should try to use the index structure and maximize the pruning potential of the search space. 

d.       Choose 100 random points as query and plot the average running time of 5NN query with standard deviation as error bars for KD-tree, M-tree and Sequential scan against dimension. Explain the trends. 

More products