$30
Data Structures
Overview
There are three datasets given with this project statement. You are required to complete the tasks, given below, for any one of the datasets. At the maximum, you can try any two of the three datasets for more understanding on how graphs help in investigating hidden patterns in the data (you can read it as, “for bonus marks”). You should first try to complete all the tasks on the smallest dataset and then use the same code for the others. (170 marks)
You are also required to submit a one-page report about your understanding of the dataset. In the report you are required to explain the patterns in the dataset as much as you can. (10 marks)
Note: Your program MUST present a cool menu to run the tasks one at a time or all at once. We will use this menu to test your program. If the menu is not cool/useful, then we will deduct 20 marks from the overall obtained marks. If the output is not cool/readable, we will deduct further 10 marks from the overall obtained marks.
Datasets
1. General Relativity and Quantum Cosmology collaboration network Dataset Description
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv (https://arxiv.org/archive/gr-qc) and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its GR-QC section.
2. Astro Physics collaboration network Dataset Description
Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv (https://arxiv.org/archive/astro-ph) and covers scientific collaborations between authors papers submitted to Astro Physics category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its ASTROPH section.
3. Amazon product co-purchasing network and ground-truth communities Dataset Description
Data was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought (explained here: https://davidgaughran.com/also-boughts-amazon-recommendations/) feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each ground-truth community.
We regard each connected component in a product category as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes.
Tasks
First Set of Tasks – Graph Stats
1. Display the number of nodes (5 marks)
2. Display the number of edges (5 marks)
3. Display the number of source nodes (5 marks)
4. Display the number of sink nodes (5 marks)
5. Display the number of isolated nodes (5 marks)
6. Display the number of bridge edges (15 marks)
7. Display the number of articulation nodes (15 marks)
8. Display the shortest path length distribution (20 marks)
9. Display the diameter of the graph (5 marks)
Second Set of Tasks – Degree Distributions
10. Display the in-degree distribution in the form of a table (10 marks)
11. Display the out-degree distribution in the form of a table (10 marks)
Third Set of Tasks – Connected Components
For the original graph:
12. Display the size of the largest strongly connected component (SCC) (25 marks)
13. Display the size distribution of all SCCs (10 marks)
Considering the graph as an undirected graph:
14. Display the size of the largest weakly connected component (WCC) (25 marks)
15. Display the size distribution of all WCCs (10 marks)
Algorithms
You will need the following algorithms for the third set of tasks.
Breadth-First Search (BFS) Algorithm
The BFS algorithm (Algorithm 1) takes as input a graph G(V, E) and a source vertex s. The algorithm returns the set of vertices that the source vertex has a path to.
Strongly Connected Components (SCC) Algorithm
The SCC algorithm (Algorithm 4) takes a graph G(V, E) as input and returns a list of all the SCCs in this graph as output. The SCC of G is computed using Eq. 1.21. The function unique returns a list of all the distinct elements in the input list. For instance, unique([1,2,1,3,2,2]) will return [1,2,3].
Weakly Connected Components (WCC) Algorithm
The WCC algorithm (Algorithm 5) computes the list of all WCCs given a graph G(V, E) as input.
Definitions
Source Vertex: A vertex with zero in-degree is called a source vertex
Sink Vertex: A vertex with zero outdegree is called a sink vertex
Isolated Vertex: A vertex with in-degree and out-degree both equal to zero is called an isolated vertex.
Degree Distribution: The degree distribution of a graph G(V, E), denoted by P(k), is defined as the probability that a random chosen vertex has degree k. If |V|k denotes the number of vertices with degree
k,
Bridge Edge: A bridge edge is an edge whose removal disconnects the graph.
Articulation Vertex: An articulation vertex is defined as a vertex which when deleted renders the graph a disconnected one.
Diameter: The diameter of a graph is the maximum of the distances between all pairs of vertices in this graph. While computing the diameter of a graph, all infinite distances are disregarded.