$25
exercise 1: Randomize this...
1. Background: Random Projections for Data Sketching (more info)
Imagine a network switch which must process dozens of gigabytes per second, and may only have a few kilobytes or megabytes of fast memory available. Imagine also that, from the huge amount of traffic passing by the switch, we wish to compute basic stats like the number of distinct traffic flows (source/destination pairs) traversing the switch, the variability of the packet sizes, etc. Lots of data to process, not a lot of space: the perfect recipe for an epic infrastructural failure.
Streaming model of computation to the rescue! The model aims to capture scenarios in which a computing device with a very limited amount of storage must process a huge amount of data, and must compute some aggregate summary statistics about that data.
Let us formalize this model using a frequency vector x.
The raw stream is a growing sequence of indices Dn = (i1,i2,...,ik,...,in), where each ik ∈ {data-alphabet} = {1,...,d}, where d, the size of the “data-alphabet”, may be very large. For our developments, think d way larger than n, possibly infinite!
At an abstract level, the goal is to initialize, store and update sequentially[1] the frequency vector x = [x1,...,xd]T with
xj = {number of indices ik in Dn equal to j} = cardinality {k : ik = j}, j ∈ {1,...,d},
and then to output some properties of x at the end of the stream, such as its length, or the number of non-zero entries, etc. If the algorithm were to maintain x explicitly, it could initialize x(0) ← [0,0,...,0]T, then, at each step k, it receives the
index ik and just increments x(ikk−1) by 1 to get the updated frequency vector x(k). Notice that, in terms of notation, x ≡ x(n). Given this sequential, explicit representation of x at the end of the stream, one can easily compute the desired properties.
So far the problem is trivial. The algorithm can explicitly store the frequency vector x, or even the entire sequence (i1,...,in), and compute any desired function of those objects. What makes the model interesting are the following desiderata:
1. The algorithm should never explicitly store the whole data stream Dn = (i1,...,in).
2. The algorithm should never explicitly build and maintain the frequency vector x.
3. The algorithm only see one index ik per round...btw, that’s the very idea of a streaming algorithm!
4. The algorithm should use O(log(n)) words of space to process the entire stream Dn.
This list rules out trivial solutions. Remarkably, numerous interesting summary statistics can still be computed under this restrictive observational model if we allow for randomized algorithms that output approximate answers.
2. The Algorithm: “length” matters...
Here we will focus on a simple randomized algorithm to evaluate the length (a.k.a. `2 norm) of x, namely kxk = qPdi=1 x2i.
The idea of the algorithm is very simple: instead of storing x explicitly, we will store a dimensionality reduced form of x.
Dimensionality reduction is the process of mapping a high dimensional dataset to a lower dimensional space, while preserving much of the important structure – in this exercise, its length kxk. In statistics and machine learning, this often refers to the process of finding a few directions in which a high dimensional random vector has maximimum variance. Principal component analysis is a standard technique but with a different “preservation target”.
Now, how do we get this compressed version of x? Simple: random projection! Why? Because there’s a neat result known as Johnson-Lindenstrauss lemma which assures that, with high probability (w.r.t. the randomization), a well designed
random projection will (almost) preserve pairwise distances and lengths between data points. Of course random projections are central in many other applications, and compressed sensing was one of the big, fascinating thing not too long ago.
Here for you the basic steps:
1. Let’s start by selecting a tuning parameter p, the “size” of the projection, with p << n << d.
2. Define L to be a (p × d) random (projection) matrix whose entries are drawn independently as N1(0,1/p).
3. The algorithm will maintain only the p-dimensional vector y (after its obvious initialization y(0)), defined at step k as
y(k) = L · x(k).
At time step k, the algorithm receives the index ik = j with j ∈ {1,...,d}, so implicitly the jth coordinate of x(k−1)
increases by 1 to get x(k). The corresponding explicit change to obtain y(k) is to add the jth column of L to y(k−1).
The Johnson-Lindenstrauss lemma (JL) then says that, for every tolerance > 0 we pick,
−2·p
Pr (1 − ) · kxk 6 kyk 6 (1 + ) · kxk > 1 − e .
(1)
So if we set p to a value of the order 1/2, then, at the end of the stream, kyk gives a (1 + ) approximation of kxk with constant probability. Or, if we want y(k) to give an accurate estimate at each time steps, we can take p = Θ(log(n)/2). Remark: please notice the remarkable fact that p, the suggested dimension of the projection/embedding, ignores completely the (presumably huge or even infinite) alphabet size d!
3. The Exercise: Comment every line of code and result you get!
Your job
1. Show the validity of the update step (i.e., the one you need to code): increasing by 1 the jth coordinate of x(k−1) corresponds to add the jth column of L to y(k−1).
2. Use R to generate the matrix L many times (say M = 1000) and setup a suitable simulation study to double-check the JL-lemma. You must play around with different values of d, n, and p (just a few, well chosen values, will be enough).
The raw stream Dn = (i1,...,in) can be fixed or randomized too, but notice that the probability appearing in Equation
(1) simply accounts for the uncertainty implied by the randomness of L – that’s why I stressed that you must generate “many times” L...
3. The main and only object being updated by the algorithm is the vector y. This vector consumes p words of space, so...have we achieved our goal? Explain.
Exercise 2: Faraway, So Close!
1. Background: Statistical Distances and Losses (more info)
Suppose that X is a random vector in Rm with distribution PX and Y is another random vector in Rm with distribution PY then, for p > 1, the p–Wasserstein distance based on the ground distance d is defined as
Z 1/p
Wd,p(PX,PY ) ≡ Wd,p(X,Y ) = inf d(x,y)p dJ(x,y) ,
J Rm×Rm
where the infimum is over all joint distributions J having PX and PY as given marginals...sounds complicated, does it? Anyway, these optimal transport techniques are extremely useful and widespread nowdays with tons of “...truly marvelous Machine Learning applications, which this homework is too narrow to contain...’’.
Nevertheless, when m = 1, that is for univariate distributions, things simplify quite a bit and it can be shown that, by picking the L1 metric (i.e. the absolute distance) as ground distance, we obtain
1 Z 1 −1 z) − F−Y1(z)p dz1/p ,
WL ,p(PX,PY ) = FX (
0
where FX(·) and FY (·) are the cumulative distribution functions of PX and PY , respectively and, consequently, F−X1(·) and F−Y1(·) are essentially their associated quantile functions.
2. The Learning Framework
1. Think about FX(·) as the true population model and assume that X ∼ Beta(α,β) for some value of (α,β) that you will choose (read about this flexible parametric family with bounded support at page 54-56 of our Gallery).
Let f(·) denote the associated true density and F−1(·) its quantile function.
2. Think about FY (·) = FY (·;θ) as an approximation to FX(·) that you need to tune/optimize with respect to the parameter vector θ in order to achieve some specific goal.
For this exercise we will consider a stepwise function (just like an histogram), whose density can be described as follow:
• Divide the inteval [0,1], the support of f(·), into bins of length h: there are N ≈ 1/h bins. Denote the bins by {B1,...,BN}.
• The approximating density is then
N
X πj
fb(x;θ) = 1(x ∈ Bj), h
j=1
where θ = (h,π1,...,πN) and 1(x ∈ Bj) is the indicator function of the jth bin.
3. Assume that we are working under perfect information, meaning that we can access/query the true model.
3. The Exercise: Comment every line of code and result you get!
Your job
1. For a fixed h, what kind of constrains we need to impose over the parameter vector (π1,...,πN) for fb(x;θ) to be a legit density?
2. Implement fb(x;θ) and its corresponding quantile function, say Fb−1(x;θ), in R
3. Pick a specific (α,β) pair and, for any h > 0, fix πj = RBj f(x)dx. Notice that now fb(·;θ) depends on a single tuning parameter, the binwidth h. For this reason, let’s simplify the notation and use fbh(·) and Fbh−1(·) to denote the approximant density and its quantile function respectively.
Let > 0 be an approximation level you are targeting in terms of p = 1 Wasserstein distance, meaning that you’re looking for the largest binwidth h (i.e. the coarsest approximation) such that
Z 1
WL1,1 f,fbh = F−1(z) − Fb−h1(z)dz 6
0
Use any (reasonable) technique you like, to study how h varies with (...and properly comment the results).
4. (Bonus). Repeat the previous point letting (π1,...,πN) free to vary (quite harder!). Compare the results.
[1] That is, on-the-fly, just by “looking” at each ik as they appear without storing the raw data in Dn at any step.