Starting from:

$30

ELEC573-Network Science & Analytics Solved

2.1) Classification of cancer types . Let us begin by analyzing a genetic network describing gene-to-gene interactions. The network was downloaded from the NCI Nature database and it is included in the data folder along with this homework. You might want to use the function scipy.io.loadmat() to load the data geneNetwork rawPCNCI.mat in Python. The network consists of 2458 genes and, loosely speaking, two genes are connected if the proteins encoded by them participate in the same metabolism process.

We are going to study the genetic profiles of 240 patients diagnosed with different subtypes of ovarian cancer. We will see that by interpreting these genetic profiles as graph signals defined on the genetic networks, we will be able to clearly distinguish patients from different subtypes. Load the file signal  mutation.mat. You will see the aggregated graph signals X ∈ R240×2458. The i-th row of this matrix represents the genetic profile xi for the i-th patient. The n-th gene for this patient is mutated if the n-th entry of xi is 1 and is not mutated (normal) if the n-th entry is 0. Patients diagnosed with the same disease may exhibit different phenotypes, i.e., different variations of the same disease. The most effective therapies for different phenotypes may differ a lot and, for this reason, it is very beneficial if we can distinguish phenotypes based on the genetic profiles. Load the file histology  subtype.mat and you will see a vector y ∈ R240 that describes the patients phenotypes. The i-th element is 1 if patient i has serous subtype ovarian cancer and 2 if the patient has endometrioid subtype ovarian cancer. Our goal is to better differentiate between patients with these two subtypes based on graph Fourier analysis.

a)Distinguishing power  Take the graph-shift operator to be the Laplacian S = L = VΛV>. We want to find the frequencies vk such that the corresponding Graph Fourier Transform coefficient x˜(k) differs the most between patients with serous subtype and endometrioid subtype. There are many ways to do this, and we consider the following simple heuristic. First compute the GFTs x˜i = V>xi for all patients. Then for each frequency k, define the distinguishing power of vk as

where 1{·} is the indicator function. In words, DP(vk) computes the normalized difference between the mean GFT coefficient for vk among patients with serous subtype and the mean GFT coefficient among patients with endometrioid subtype. Generate a plot of DP(vk) versus k for all frequency indices k. Also, generate a boxplot of all the values taken by DP(vk).

b)k-NN for classification [8 points]. Implement a k nearest neighbors (k-NN) classifier and perform leaveone-out cross validation. More precisely, implement the following steps:

1) Compute the pairwise Euclidean distance between all pairs of patients using the original graph signals X. 2) For each patient, compare the most common histology of its k nearest neighbors to its actual diagnosis obtained from y.

3) Compute the accuracy of the classifier by aggregating all the comparisons of the previous step. Report the accuracies for k ∈ {3,5,7}.

c) Filtering almost all frequencies [7 points]. Consider a graph filter whose frequency response is to only let pass the frequency with the highest distinguishing power, i.e.

                                                                   h˜ (k) =         1, ifk = argmaxk DP(vk),                                                                  (2.
10, otherwise.

2-1

2-2                                                                                                                                        Denote the filtered GFT as x˜fi and its inverse GFT as xfi . Form a filtered graph signal matrix Xf with each of its rows representing the filtered genetic profile of each patient. Run the k-NN classifier of the previous point but using Xf as your data. Report your accuracies for k ∈ {3,5,7}. What do you observe? d) Filtering most frequencies . Consider now the more general filter

                                     h˜p(k) =          1, if DP(vk) ≥ p-th percentile of the distribution of DP,                                     (2.3)

0, otherwise,

where p is any real number between 0 and 100. This family of graph filters keep the information conveyed in frequencies that are distinguishable for two subtypes to some extent. The p here is chosen to select the number of frequencies to keep. Run the k-NN classifier on the filtered signals for different values of p. Report your accuracies for k ∈ {3,5,7} and p ∈ {0.75,0.80,0.85,0.90,0.95}.

2.2) Epidemics across the world

. In this problem we will consider the issue of epidemic spreading on networks and how to prevent their outbreaks. We will consider theoretical and computational aspects of an epidemic model while focusing on a real-world network of global flight connections as our main dataset.

a)Construct a network from the flight data 

You have been provided with 2 csv files (named airport Nodes GC.csv and airport Edges  GC.csv) that contain all the information of all the airports (nodes), and flights (links)[1]. The edges are directed and weighted, and are proportional to the number of daily flights from one airport to another. You should construct a network G from this data as follows.

•    Read in the edge data and create an initial (directed) network.

•    We will work with undirected graphs, so form the undirected graph of this network by symmetrizing its adjacency matrix as follows

Asym = (A + A>)/2.

•    If the undirected graph obtained is not connected, keep its giant component as your new graph and work with it from here onwards.

Hint: Note that command networkx.DiGraph.toundirected returns an undirected graph with the same name and nodes and with edge (u,v,data) if either (u,v,data) or (v,u,data) is in the digraph. If both edges exist in the digraph and their edge data is different, only one edge is created with an arbitrary choice of which edge data to use. You must check and correct for this manually if desired, or use other method to obtain the symmetry operation required above.

b)Plot the airport network 

Plot the flight network, using some of the extra information for the node files: use the longitude and latitude coordinates provided when drawing the nodes, and scale the node size according to their eigenvector centrality. You should take into account the weights of the edges when computing this centrality. You may find it useful as well to adjust the transparency of the edges for visual clarity. Hint: As a sanity check, the two nodes with highest eigenvector centrality (eigenvector normalized to have unit norm) should have the values approximately 0.179 and 0.170.

We will now study an epidemic model on this network, where every node is in one of two states: it is either infected, or susceptible to being infected. We denote the set of the vertices by V = {v1,...,vn} and the adjacency matrix by A = [aij]. The state of node vi at time t ≥ 0 is a binary random variable Xi(t) ∈ {0,1}. The state Xi(t) = 0 (resp., Xi(t) = 1) indicates that node vi is in the susceptible (resp., infected) state. We define the vector of states as X(t) = [X1(t),...,Xn(t)]>. The state of a node can experience two possible stochastic transitions:

i)Assume node vi is in the susceptible state at time t. This node can switch to the infected state during the time interval [t,t+∆) with a probability that depends on: (a) an infection rate β, (b) the probability of Homework 2 – Due 10/19/21      2-3

contact with its neighboring nodes Ni = {j | aij 6= 0} and their states. We can write the probability of this transition as

Prob[Xi(t + ∆) = 1|Xi(t) = 0,X(t)] = 1 − Πj∈Ni∧Xj(t)=1(1 − βaij).                                          (2.4)

We usually have  

   1. Therefore, instead of (2.4), we use the following first-order approximation Prob[Xi(t + ∆) = 1|Xi(t) = 0,X(t)] = X βaijXj(t).       (2.5) j∈Ni

ii) Assuming that node vi is infected, the probability of vi recovering back to the susceptible state in the time interval [t,t + ∆) is given by

Prob[Xi(t + ∆) = 0|Xi(t) = 1] = γ,                  (2.6)

where 0 ≤ γ ≤ 1 is the curing rate.

c)Show that the first-order approximation in (2.5) is in fact an upper bound for the original transition probabilities in (2.4).

The spread model characterized by (2.5) and (2.6) may be hard to analyze for large-scale networks. One standard approach is to use a mean-field approximation of the model: define pi(t) = Prob[Xi(t) = 1] = E[Xi(t)], i.e., the marginal probability of node vi being infected at time t. We can use (2.5) and (2.6) to approximate the dynamics of pi(t):

 

This approximation is widely used in the field of epidemic analysis and control, since it performs numerically well for many realistic network topologies.

d) Simulating the spread dynamics using the mean-field approximation 

Suppose that the first 20 nodes (according to their ids in the dataset) are initially infected. Use (2.7) to simulate the evolution of the expected size of the infection defined as ¯         ) over time, assuming a recovery rate γ = 0.1 and an infection rate β = 0.1. (Hint: you can use   where ∆ = 0.05 and time horizon [0,5]). Repeat the experiment for β = 0.05 and β = 0.01. Plot three curves (one for each value of β) showing the evolution of the expected size of the infection over time.

It can be shown that a sufficient condition for virus extinction, that is to ensure that the virus will eventually die out, is to have                                                                 

 

 (2.8)
An immunization strategy is to choose a subset of nodes I ⊆ V and immunizing them against the virus. An immunized node can neither get infected nor pass the infection. We call an immunization strategy “effective” if it results in the eventual extinction of the virus (almost surely), no matter how widespread the initial infection is. A potential idea for designing an effective immunization strategy is to rank the nodes based on some centrality measure and immunize them in order of their centralities until the condition (2.8) is satisfied. Note that immunizing node vi is equivalent to removing the i-th row and column from A.

e)Assume that β = 0.01 and γ = 0.4. (i) What is the minimum number of immunizations required to satisfy condition (2.8) if you use (weighted) degree centrality to sort the nodes? (ii) What is this minimum number when you use (weighted) eigenvector centrality instead?

In solving this last point, compute a single ranking for the nodes for the complete network and delete nodes one by one until you satisfy the condition (2.8). No need to recompute the centralities after deleting nodes (although that could be another reasonable heuristic).

 

[1] The original dataset was made available by Tore Opsahl and is discussed in the blog postWhy Anchorage is not (that) important: Binary ties and Sample selection. Available at http://wp.me/poFcY-Vw.

More products