Starting from:

$30.99

Computational Stats-Lab 1: Reminder on Markov Chain –Stochastic gradient descent Solved

Exercise 1 : Box-Muller and Marsaglia-Bray algorithm
Let R a random variable with Rayleigh distribution with parameter 1 and Θ with uniform distribution on [0,2π]. We also assume that R and Θ are independent.

 r2 !

                                                        ∀r ∈ R,               fR(r) = rexp              − 2 1R+(r)

Let X and Y such that
                                                                 X = Rcos(Θ)             and              Y = Rsin(Θ).

Prove that both X and Y have N(0,1) distribution and are independent.

Write an algorithm for sampling independent Gaussian distribution N(0,1).
Algorithm 1: Marsaglia-Bray algorithm

while V12 + V22 > 1 do
Sample U1,U2 independant r.v. with distribution U([0,1]) ;
Set V1 = 2U1 − 1 and V2 = 2U2 − 
end
Set S ; 6 Set X  and Y  ;
7 return (X,Y ).

Consider the algorithm given above.
What is the distribution of (V1,V2) at the end of the "while" loop?
Set
V1
                                                                  T = q          and          V                          .

V12 + V22

Show that T and V are independent, V ∼ U([0,1]) and T has the same distribution as cos(Θ) with Θ ∼ U([0,2π]).

What is the distribution of the output (X,Y )?
What is the acceptance probability of the test in the "while" loop?
Exercise 2 : Invariant distribution
We define a Markov chain (Xn)n≥0 with values in [0,1] as follows : given the current value Xn (n ∈ N) of the chain, • if Xn = 1 (for some positive integer m), we let :

m

                                                          Xn        with probability 1 − Xn2

                                                            Xn+1 ∼ U [0,1])         with probability Xn2.

if not, Xn+1 ∼ U [0,1]).
Prove that the transition kernel of the chain (Xn)n≥0 is given by :
                                                                 x2 Z               dt + (1 − x2)δm1+1(A)           if x = 1

                                                     P(x,A) = Z A∩[0,1]                                                                          m

                                                                               dt                                                   otherwise.

A∩[0,1]

where δα is the Dirac measure at α.

Prove that the uniform distribution on [0,1] is invariant for P. In the following, this invariant distribution will be denoted by π.
Let x /            . Compute Pf(x) for a bounded measurable function f. Deduce
m Pnf(x) for all n > 1. Compute n→lim+∞Pnf(x) in terms of R f(x)π(x)dx.

Let x = m1 with m > 2.Let n ∈ N∗. Compute Pnx, n +1 m in terms of m and n.
Do we have n→lim+∞Pn(x,A) = π(A) when A = [m + 1 +1 q?
q∈N

Exercise 3 : Stochastic Gradient Learning in Neural Networks, [Bot91, BCN16]
In the exercise, we consider the problem of classifying patterns x into two classes y = ±1. We assume that there is a relationship between a pattern and its class, embodied by some probability distribution P(x,y). If we know this distribution, we know the conditional probabilities P(y|x) as well, and we can solve immediately the problem using the Bayes decision rule. Learning means “Acquiring enough knowledge about P(x,y) from the examples to solve the classification problem”.

The statistical machine learning approach begins with the collection of a sizeable set of examples {(x1,y1),...,(xn,yn)}, where for each i ∈ 1,n the vector xi represents the features and the scalar yi a label indicating whether xi belongs (J Kyi = 1) or not (yi = −1) to a particular class. With such a set of examples, one can construct a classification program, defined by a prediction function h, and measure its performance by counting how often the program prediction h(xi) differs from the correct prediction yi. To avoid rote memorization, one should aim to find a prediction function that generalizes the concepts that may be learned from the examples. One way to achieve good generalized performance is to choose amongst a carefully selected class of prediction functions.

Thanks to such a high-dimensional sparse representation of documents, it has been deemed empirically sufficient to consider prediction functions of the form h(x;w,τ) = twx − τ . Here, twx is a linear discriminant parameterized by w ∈ Rd and τ ∈ R is a bias that provides a way to compromise between precision and recall, P[y = 1|h(x) = 1] and P[h(x) = 1|y = 1] respectively. The accuracy of the predictions could be determined by counting the number of times that sign(h(x;w,τ)) matches the correct label, i.e., 1 or -1. However, while such a prediction function may be appropriate for classifying new features, formulating an optimization problem around it to choose the parameters (w;τ) is impractical in large-scale settings due to the combinatorial structure introduced by the sign function, which is discontinuous. Instead, one typically employs a continuous approximation through a loss function that measures a cost for predicting h when the true label is y.

An Adaline (Widrow and Hoff, 1960) actually learns by (i) considering linear prediction function, h(x,w) = twx, and (ii) measuring the quality of the system through the mean squared error :

CAdaline(w) = Z (y − h(x,w))2 dP(x,y) = Z (y − twx)2 dP(x,y).

Learning consists of finding the parameter w? that minimizes the above, or a more general, cost. This framework is the basis of classical statistical inference theory. Hundreds of practical algorithms have been derived.

In the following, we will denote by z = (x,y) the observation and consider the cost or expected risk given a parameter vector w with respect to the probability P

R(w) = E[J(w,z)] = Z (y − twx)2 dP(z).

While it may be desirable to minimize the expected loss that would be incurred from any inputoutput pair, such a goal is untenable when one does not have complete information about P. Thus, in practice, one seeks the solution of a problem that involves an estimate of the expected risk R. In supervised learning, one has access (either all-at-once or incrementally) to a set of n ∈ N independently drawn input-output samples {zi = (xi,yi)}ni=1 and one may define the empirical risk function Rn : Rd → R by

Rn(w) = 1 Xn (yi − twxi)2 n

i=1

Describe the stochastic gradient descent algorithm for minimizing the empirical risk and implement it.
Sample a set of observations {zi}ni=1 by generating a collection of random points xi of R2, w¯ ∈ R2 seen as the normal vector of an hyperplane, a straight line here, and assigning the label yi according to the side of the hyperplane the point xi 
Test the algorithm you wrote at the first question over these observations. What is the vector w? estimated? Is it far from w¯ ?
Noise your observations {zi}ni=1 with an additive Gaussian noise and perform the optimisation again. Compare with the result of question three.
Test the algorithm on the Breast Cancer Wisconsin (Diagnostic) Data Set [WSM95] :
Références

[BCN16] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. eprint arXiv :1606.04838, 2016.

[Bot91]       Léon Bottou. Stochastic gradient learning in neural networks. In Neuro-Nîmes 91, 1991.

[WSM95] William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian. UCI machine learning repository, 1995.

More products