$25
Who let the DAGs out?
Remember DAGs? Good. It’s now time to learn-how-to-learn their topology (at least in the Gaussian case) and then put them to work in a biological setup.
Our statistical recipe needs a lot of ingredients, namely:
1. The basics of the Likelihood Ratio Test (lrt) method.
2. The concept and ideas behind Universal Inference.
3. The notion of Gaussian DAGs and their (constrained) likelihood functions
4. The incarnation of lrt for testing directed connections in Gaussian DAGs ...I guess, we better start...
Ingredient (A): The Likelihood Ratio Test
The Wald test is useful for testing a scalar parameter. The Likelihood Ratio Test (lrt) is more general and can be used for testing a vector–valued parameter. More specifically:
The Likelihood Ratio Test Within a parametric model F = {fθ(·) : θ ∈ Θ ⊆ Rp}, consider testing
H0 : θ ∈ Θ0 vs H1 : θ ∈/ Θ0
Given an iid sample Dn = {X1,...,Xn} ∼ fθ(·) with the associated likelihood function L(θ |Dn) = Qi fθ(Xi), the likelihood ratio test rejects the null if Un > c for some suitable critical level c, with
Un = supsupθθ∈∈ΘΘ0LL((θθ|D|Dnn)) = LL((θbθb0|D|Dnn)),
where θb0 denotes the constrained to Θ0 mle (i.e. assuming H0 is true), whereas θb denotes the unconstrained mle.
Remarks:
• Replacing Θc0 with Θ in the denominator has little effect on the test statistic and the unconstrained version simplifies the theoretical properties of the test statistics.
• The likelihood ratio test is most useful when Θ0 consists of all parameter values θ such that some coordinates of θ are fixed at particular values.
Let’s now look at a famous example on testing for the mean of a Normal population: one of the few cases where we have exact, finite sample, results.
Example: Student’s t–test
Let Dn = {X1,...,Xn} be iid from a N(µ,σ2) and we want to test
H0 : µ = µ0 vs H1 : µ =6 µ0 Un = LL(µ(µb0,,σσbb0|D|Dnn)),
where σb0 maximizes the likelihood under the null, that is, subject to µ = µ0.
After some simple but tedious algebra, it can be shown that Un > c ⇔ Tn > c0 where
= X¯σ/n −√nµ0 under∼ H0 tn−1 with σb2 = n1 X(Xi − X¯n)2.
Tn
b i
So the final two–sided test on the mean µ of a Normal population is:
Reject H0 if |Tn| > tn−1,α/2 (Student’s t–test).
Similarly to the Wald test, in more general situations where we are dealing with non-Gaussian (but still regular!) populations, all we can do to appropriately tune the critical value c in order to control the type-I error probability of our lrt, is to appeal to some suitable, broadly applicable, asymptotic result. More specifically, here’s two classics:
Asymptotic approximation / scalar Consider testing H0 : θ = θ0 versus H1 : θ =6 θ0 where θ ∈ R.
Then, under H0 (+ regularity conditions on the population model F),
Tn = 2logUn −d→ χ21.
Hence an asymptotic level α test is: Reject H0 when Tn > χ21,α.
Asymptotic approximation / vector Consider testing a null H0 : θ ∈ Θ0 ⊆ Rp where we are fixing some parameters.
Then, under H0 (+ regularity conditions on the population model F),
Tn = 2logUn −d→ χ2ν where ν = dim(Θ) − dim(Θ0).
Hence an asymptotic level α test is: Reject H0 when Tn > χ2ν,α.
Ingredient (B): Universal Inference
As we all should know at this point, in classical frequentist statistics, confidence sets and tests are often obtained by starting from asymptotically Gaussian estimators or other large sample results.
As a consequence, their validity relies on large sample asymptotic theory and requires that the statistical model satisfy certain regularity conditions. When these conditions do not hold, or the sample is not large “enough”, there is no general method for statistical inference, and these settings are typically considered in an ad-hoc manner.
Recently a new, universal method simply based on sample splitting has been introduced which yields tests and confidence sets for any statistical model (regular or not) and comes also with finite-sample guarantees.
Focusing on hypothesis testing, historically speaking sample splitting was first analysed computationally and theoretically in a 1975-4-pages-long paper by sir David Cox, one of the greatest statisticians of all times, and then further discussed in his 1977 review (2, Section 3.2), where he describes the method as well known and refers to an American Statistician paper with a wide-ranging discussion of “snooping”, “fishing”, and “hunting” in data analysis. Honestly? An easy read way more relevant now than then!
Let’s now describe this idea in the context of lrt:
Universal Hypothesis Test
Let D2n = {X1,...,X2n} be an iid sample from a population model F having density fθ(·) and consider testing
H0 : θ ∈ Θ0 vs H1 : θ ∈/ Θ0.
To this end, (randomly) split the data D2n in two groups having the same size n (just to simplify the notation), and build the two corresponding likelihood functions:
D2n DnTr,DnTe nL(θ |DnTr) = Qi∈DnTr fθ(Xi), L(θ |DnTe) = Qi∈DnTe fθ(Xi)o
Consider now the following two mle’s for θ
= argmaxL(θ |DnTr) θbTe= argmaxL(θ |DnTe)
θ∈Θ0 All θ
H0-constrained estimator based on the Training Data Unconstrained estimator based on the Test Data At this point we are ready to define the following two universal test statistics:
Te
n L θbTr |DnTr and Wn = Un +2Unswap (1)
U =
L θb0 |DnTr
Split Likelihood Ratio Cross-Fit Likelihood Ratio
where Unswap is the same as Un after swapping the roles of DnTr and DnTe.
Based on Un and Wn, we get the following two univeral testing procedures:
Reject H0 if Un > , and Reject H0 if Wn > (2)
Split LRT Cross-Fit LRT
By simply using Markov inequality and under no assumptions on the population model F, in Theorem 3 the Authors show that in finite sample the Split and Cross-Fit LRTs control the type-I error probability at level α.
Ingredient (C): Gaussian DAGs and their (constrained) Likelihood We (should) know that we use DAG models to encode the joint distribution f(x) of a random vector X = [X1,...,Xp]T ∈ Rp: the nodes and the directed edges represent, respectively, the variables and the parent-child dependence relations between any two variables. We also know that the joint f(x) is Markov w.r.t. a graph G if it admits the following factorization
p
f(x) = Y f xj |pa(xj),
j=1
where pa(xj) denotes the set of variables with an arrow towards Xj in the (directed) graph G.
Our main goal here, is to lay down a strategy to infer the pairwise relation imposed by the (local) Markov dependence. To get started, first of all we restrict the scope of our analysis focusing on a Gaussian random vector X ∼ Np.
Under this assumption, we can capture the directional effects induced by directed edges by using a linear structural
equation model[1]
Xj = X A[j,k] · Xk + j where j iid∼ N1(0,σ2), (3)
k:k6=j
and A is a (p × p) adjacency matrix in which a nonzero entry A[j,k] in position (j,k) corresponds to a directed edge from parent node k to child node j, with its value indicating the strength of the relation; A[j,k] = 0 when k 6∈ pa(Xj).
Learning Goal
Given a random sample Dn = {X1,...,Xn} iid∼ f(·), where Xi = [Xi,1,...,Xi,p]T ∈ Rp, infer the adjacency matrix A subject to the requirements that A defines a directed acyclic graph.
Denoting by θ = (A,σ2) the parameters of our model, to approach the problem from a maximum likelihood perspective, we need to write down:
1. The (log-)likelihood function `n(θ) = lnL(θ |Dn).
2. How to introduce acyclicity constraints to drive the learning process.
By collecting all the available data in an (n × p) matrix X = [x1 ···xi ···xn]T, and because of the Gaussian nature of the structural equation model in Equation 3, the log-likelihood is readily available:
`n(A,σ2) = −Xp 2σ12 Xn X[i,j] − Pk:k=6 jA[j,k] · X[j,k]2 + n2 lnσ2!. (4)
j=1 i=1
The acyclicity constraints are all but trivial. Fortunately, the hard work has already been done by Yuan et al. in 20192. Based on their Theorem 1, we know that A satisfies the acyclicity constraints if and only if there exist a (p × p) matrix Λ s.t.:
Λ[r,k] + 1(j =6 k) − Λ[j,k] >1 A[r,j] = 06 for all r,j,k ∈ {1,...,p} and r =6 j. (5)
where 1(·) denotes the indicator function. Although the matrix Λ may not be unique, Equation 5 has three clear pros:
1. It reduces the original super-exponentially many constraints on A to p[2] − p2 active constraints on (A,Λ).
2. Code acyclicity in a simple way: each constraint involves only one parameter in A and is linear in Λ[j,k] and 1 A[r,j] = 06 . 3. The non-convexity of the constraints in Equation 5 due to the presence of the indicator 1 z = 06 can be easily handled by replacing it with its computational surrogate: the truncated L1-function: Jτ(z) = min(|z|/τ,1), a piecewise linear function that converges to the indicator as τ ↓ 0. This yields the following approximated set of acyciclity constraints:
Λ[r,k] + 1(j =6 k) − Λ[j,k] > Jτ A[r,j] for all r,j,k ∈ {1,...,p} and r =6 j. (6)
Now, before putting everything together in a test statistics, one last addition is the introduction of a sparsity/complexity constraint on A, very useful to handle high-dimensional setups where p >> n:
X1 A[r,j] = 06 6 κ ! X Jτ A[r,j]6 κ (7) r,j :r6=j r,j :r6=j
Here κ > 0 is a tuning parameter to be selected possibly via a suitable grid-search.
Learning Problem
Squeezing together Equations 4, Equation 6 and Equation 7, the directed acyclic graph learning problem can be reformulated from a constrained maximum likelihood perspective as follow
Ab,σb[3] = argmax2 `n(A,σ2)
(A,σ ,Λ)
subject to X Jτ A[r,j]6 κ (8)
r,j :r6=j
Λ[r,k] + 1(j =6 k) − Λ[j,k] > Jτ A[r,j] for all r,j,k ∈ {1,...,p} and r 6= j
Ingredient (D): Constrained Likelihood Ratio Test
Now that we know how to fit our Gaussian model enforcing acyclicity (and sparsity!), we are finally in position to build some dedicated lrt to check the presence of some specific directed connection of interest. Along the lines of Li et al. (2019)3, we have two possible types of tests concerning directional pairwise relations, encoded by an adjacency matrix A. More specifically:
1. Test of Graph Linkages
Let F be an index set where an index (j,k) ∈ F represents a directed connection. We are interested in testing:
H0 : A[j,k] = 0 for all (j,k) ∈ F vs H1 : not H0 Un = LL AbAb0,,σbσb202DDnn (9)
Constrained Likelihood Ratio Statistics where (Ab,σb2) are the unconstrained mles solving the optimization problem in Equation 8, whereas (Ab0,σb02) are the
H0-constrained mles solving the following H0-augmented optimization problem
Ab,σb2 = argmax2 `n(A,σ2)
(A,σ ,Λ)
subject to A[F] = 0
Pr,j :r6=jJτ A[r,j]6 κ (10)
Λ[r,k] + 1(j =6 k) − Λ[j,k] > Jτ A[r,j] for all r,j,k ∈ {1,...,p} and r 6= j
2. Test of Directed Pathway
A directed pathway is specified by an index set F of size |F| where a common segment is shared by any two consecutive indices, like F = (j1,j2),(j2,j3),...,(j|F|−1,j|F|) . We are interested in testing:
H0 : A[j,k] = 0 for some (j,k) ∈ F vs H1 : A[j,k] 6= 0 for all (j,k) ∈ F Un = max|kF=1L| LAb,Abσb02,k,Dσbn02,kDn (11)
Constrained Likelihood Ratio Statistics
where (Ab,σb2) are the unconstrained mles solving the optimization problem in Equation 8, whereas (Ab0,k,σb02,k) are the H0-constrained mles solving, for each k ∈ {1,...,|F|}, the following H0-augmented optimization problem:
Ab, = argmax `n(A,σ2)
(A,σ2,Λ)
subject to A[jk,jk+1] = 0 for (jk,jk+1) ∈ F (12)
Pr,j :r6=jJτ A
Λ[r,k] + 1(j =6 k) − Λ[j,k] > Jτ A[r,j] for all r,j,k ∈ {1,...,p} and r 6= j
Your Main Goal
Despite the fact that the Authors of the original papers derived and implemented in MLEdag() the χ2-asymptotics for the Likelihood Ratio statistics under both testing scenarios, your goal here is to push the finite-sample envelope and adapt the two universal procedures of Equation 2 to the current framework.
You will be using MLEdag() to get the mles you need and also to compare the results you get install that package!
The Data: Cell Signalling
We will use all the machinery above to dig the multivariate flow cytometry data in Sachs et al. (2005). Suppose we are studying the human immune system, in particular the so called T helper cells. To properly understand their normal responses under some specific extracellular stimuli, we can perturb them with a series of chemical stimulatory/inhibitory interventions and then profile the effects of each condition by measuring the expression of relevant proteins and lipids via flow cytometry. The data collection process is described in Figure 1:
1. each cell is treated as an independent observation;
2. 9 different perturbations were applied to sets of individual cells, namely: cd3cd28, cd3cd28icam2, cd3cd28+aktinhib, cd3cd28+g0076, cd3cd28+psitect, cd3cd28+u0126, cd3cd28+ly, pma, b2camp. The known biological effects of the perturbations employed are described in Table 1 of Sachs et al. (2005);
3. On each cell in each condition, a multiparameter flow cytometer simultaneously records the levels of 11 proteins and lipids, namely: praf, pmek, plcg, PIP2, PIP3, p44/42, pakts473, PKA, PKC, P38, and pjnk. This simultaneous measurements allow researchers to infer causal influences in cellular signalling.
Figure 1: Source: Sachs et al. (2005)
The Exercise: ToDo List
1. Read (really) these notes, and install the clrdag package directly from git:
install.packages(devtools)
devtools::install_github("chunlinli/clrdag/pkg/clrdag")
Notice that our sparsity parameter κ is mu in the main function MLEdag(). To install this package on OSX, you need to “brew” gcc and then make it visible to R. Please read this and/or contact us.
2. By using the MLEdag() function to get constrained and unconstrained mles, adapt and implement in R at least one of the universal tests in Equation 2 to the problem of testing for graph linkages and directed pathway.
3. Starting from the code in the examples folder of the clrdag package and trying to find some inspiration from Example 1 and Example 2 in Section 5.1 of Li et al., design and run a decent simulation study to check size and power of your universal test(s) for linkage.
Let’s now move to cell signalling and flow cytometry. By stacking together the data coming from all 9 interventions, Sachs et al. (2005) estimated the DAG in Figure 2 via a data-discretization technique. The edges were categorized as: (i) expected, for connections well-established in the literature that have been demonstrated under numerous conditions in multiple model systems; (ii) reported, for connections that are not well known, but for which we were able to find at least one literature citation; and (iii) missing, which indicates an expected connection that their network failed to find. Of the 17 arcs in their model, 15 were expected, all 17 were either expected or reported, and 3 were missed
Figure 2: Source: Sachs et al. (2005)
4. Looking at the estimated DAG in Figure 2, formalize at least 3 linkage-type hypotheses and 1 pathway-type hypothesis that you feel it may be interesting to double-check with your own toolkit. Briefly explain why you find them interesting.
5. Select one specific intervention out of the 9 available and, based on those data only, test your set of hypotheses using both, your universal procedures and the asymptotics implemented in the MLEdag() function.
Compare the results also as you let the sparsity parameter κ vary. Achtung! Achtung!
Before running any test, keep in mind we are working under the assumption that the data are zero-mean Gaussian! Use basic tools like boxplot(), hist(), density(), qqplot(), pairs(), to visually check what’s going on with each of your 11 variables, and possibly apply suitable transformations like scale() and MASS::boxcox() to “enforce” it.
6. Repeat the previous analysis augmenting the data by stacking together those coming from different conditions. Despite their heterogeneity, ideally we would like to pull together all the data available as Sachs et al. (2005) did. Nevertheless, if you find handling the “über-dataset” computationally infeasible, simply keep stacking data till you can. Compare the results with those in 5. and draw some conclusions. Do you think we need to adjust for multiplicity here? Explain.
[1] The homoscedasticity of the error {j}j is not required but useful to induce identifiability and avoid technicalities regarding equivalence classes.
In addition, individual means µj could be incorporated by adding intercepts to Equation 3. For simplicity, in what follows we set the means to zero.
[2] An implementation of their method is available at: https://github.com/chunlinli/clrdag. Notice that our sparsity parameter κ is mu in the main function MLEdag(). To install this package on OSX, you need to “brew” gcc and then make it visible to R. Please read this and/or contact us.
[3] An implementation of their method is available at: https://github.com/ciryiping/TLPDAG.