$30
We begin by exploring numerically some of the properties of Erd˝os R´enyi (ER) graphs. For this we will create ER graphs with n = 5,000 nodes, and increasing probability p.
a) Create a number of ER graphs with increasing probability p. Your probabilities should cover the range p = [10−5,10−2] with ‘logspace’ (e.g., np.logspace in python). Plot the size of the largest connected component relative to the number of nodes n (i.e., if the giant component consists of the whole graph its size is 1), as a function of p. For each p you should create 20 graphs and plot the mean value plus / minus the standard deviation of the giant component size. (hint: consider using the functions fast gnp random graph and connected components in networkx).
b)What do you observe for p ≈ 1/n? What happens for p ≈ ln(n)/n? Provide a brief description of these phenomena in terms of the threshold properties that we studied in class.
c) ] Let A denote the event that node 1 has at least l > 1 neighbors in an ER graph Gn,p. Consider a family of functions t(n) = r/n for any r ∈R+. Show that 0 and P(A) → 1 if for increasing n and any r ∈R+.
3.2) Fitting power-law exponents
[adapted from Kolaczyk 4.1, credit Gonzalo Mateos] Consider the problem of estimating the exponent α > 0 of a power-law distribution with probability density function (PDF) proportional to x−α.
a)A random variable X is said to follow a Pareto distribution if it has a PDF
, for x ≥ x0 > 0.
Verify that f(x) is a valid PDF, and determine the complementary cumulative distribution (CCDF) function F¯(x) = P[X ≥ x].
b) Generate a synthetic dataset of n = 100,000 independent and identically distributed samples from a Pareto distribution, for α = 1 and x0 = 1. Samples may be drawn from a Pareto distribution by using the inverse transform method. Specifically, having generated random numbers u1,u2,...,un in [0,1], the values will follow the desired Pareto distribution (Can you argue why is this true?). Plot on a log-log scale the empirical estimate of P[X = x] from your data. To obtain this estimate, you may round each point to the nearest integer and compute a normalized histogram over integer values. Use ‘points’ instead of ‘bars’ for your histogram plot in log-log scale; and superimpose a plot of the actual PDF (i.e., a line of slope −α − 1) to verify that you generated the data correctly.
c) Show that under the Pareto model the log-likelihood function is given by
n
`n(α) = X[logα + αlogx0 − (α + 1)logxi]
i=1
and it is maximized by the Hill estimator that we discussed in class.
d) For n = 100,000, α = 1, and x0 = 1, compute estimates of the exponent α via (i) least-squares regression on the log-log empirical distribution (histogram) logP[X = x]; (ii) least-squares regression on the log-log empirical CCDF logF¯(x); and (iii) maximum likelihood as derived in c). Repeat the data-generation
process and estimation 100 times, and report the sample mean and standard deviations for each method (i)-(iii). Based on these values, which method gives the best estimate?
3.3) Small-world clustering coefficient
Consider the following variation on the smallworld model studied in class. Just like in the original model, we start with a circular lattice with n nodes in which each node is connected to the 2k nodes that are k positions or less away in the lattice. For each edge in this lattice, with probability p we add an edge between two nodes nodes chosen uniformly at random but we do not delete the original edge. That is, instead of rewiring edges, we keep the original lattice graph and add a few shortcut edges on top of it.
a)Show that when p = 0, the overall clustering coefficient of this graph is given by
b) Show that for small p > 0 (so that terms with p2 can be ignored), the expected clustering coefficient when n →∞ is given by
3.4) Uniform attachment model
[adapted from MIT 1.022] In this exercise, we will implement the mean-field approximation idea that we saw in class to a non-preferential attachment setting. Consider a system where nodes are born over time and form edges to existing nodes at the time of their birth. The network formation process is as follows:
• Nodes are born over time and indexed by their time of birth, i.e., node i is born at time instant I for i = 1,2,....
• The network is initialized with m nodes (born at times 1,...,m), all connected to one another. The first newborn node is thus the one born at time m + 1.
• Each newborn node selects m nodes uniformly at random among the existing set of nodes and links to them.
Let ki(t) be the degree of node i at time t. Let’s use a continuous-time mean-field analysis o track the evolution of the expected degrees of nodes. a) What is ki(i) for i > m?
b) Find an expression for dki(t)/dt in the mean-field approximation and verify that ki(t) = m + mlog(t/i) for t ≥ i is a solution to that differential equation.
Let p(d) for d ≥ m be the probability of a node having degree d in the uniform attachment mode. Derive an exptession for p(d) from the CCDF as done for the preferential attachment model