$25
Exercise 1: Estimating the Bayes Error (10+10+10 P)
The Bayes decision rule for the two classes classification problem results in the Bayes error
P(error) = Z P(error|x)p(x)dx,
where P(error|x) = min[P(!1|x), P(!2|x)] is the probability of error for a particular input x. Interestingly, while class posteriors P(!1|x) and P(!2|x) can often be expressed analytically and are integrable, the error function has discontinuities that prevent its analytical integration, and therefore, direct computation of the Bayes error.
(a) Show that the full error can be upper-bounded as follows:
P(error) .
Note that the integrand is now continuous and corresponds to the harmonic mean of class posteriors weighted by p(x).
(b) Show using this result that for the univariate probability distributions
and ,
the Bayes error can be upper-bounded by:
P(error)
(Hint: you can use the identity
(c) Explain how you would estimate the error if there was no upper-bounds that are both tight and analytically integrable. Discuss following two cases: (1) the data is low-dimensional and (2) the data is high-dimensional.
Exercise 2: Bayes Decision Boundaries (15+15 P)
One might speculate that, in some cases, the generated data p(x|!1) and p(x|!2) is of no use to improve the accuracy of a classifier, in which case one should only rely on prior class probabilities P(!1) and P(!2) assumed here to be strictly positive.
For the first part of this exercise, we assume that the data for each class is generated by the univariate Laplacian probability distributions:
and .
where µ, > 0.
(a) Determine for which values of P(!1),P(!2),µ, the optimal decision is to always predict the first class (i.e. under which conditions P(error|x) = P(!2|x) 8 x 2R).
(b) Repeat the exercise for the case where the data for each class is generated by the univariate Gaussian probability distributions:
and .
where µ, > 0.
Exercise 3: Programming (40 P)
Download the programming files on ISIS and follow the instructions.
Programming Sheet 1: Bayes Decision Theory (40 P)
In this exercise sheet, we will apply Bayes decision theory in the context of small two-dimensional problems. For this, we will make use of 3D plotting. We introduce below the basics for constructing these plots in Python/Matplotlib.
The function numpy.meshgrid
To plot two-dimensional functions, we first need to discretize the two-dimensional input space. One basic function for this purpose is numpy.meshgrid . The following code creates a discrete grid of the rectangular surface $[0,4] \times [0,3]$. The function numpy.meshgrid takes the discretized intervals as input, and returns two arrays of size corresponding to the discretized surface (i.e. the grid) and containing the X and Y-coordinates respectively.
[0 1 2 3 4]
[0 1 2 3 4]]
[[0 0 0 0 0]
[1 1 1 1 1]
[2 2 2 2 2]
[3 3 3 3 3]]
Note that we can iterate over the elements of the grid by zipping the two arrays X and Y containing each coordinate. The function numpy.flatten converts the 2D arrays to one-dimensional arrays, that can then be iterated element-wise.
In [2]:
[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (0, 3), (1, 3), (2, 3), (3, 3), (4, 3)]
3D-Plotting
To enable 3D-plotting, we first need to load some modules in addition to matplotlib :
appropriate size:
In [4]:
(81, 81) (81, 81)
Here, we have used a discretization with small increments of 0.1 in order to produce a plot with better resolution. The resulting meshgrid has size (81x81), that is, approximately 6400 points. The function $f$ needs to be evaluated at each of these points. This is achieved by applying element-wise operations on the arrays of the meshgrid. The norm at each point of the grid is therefore computed as:
In [5]:
(81, 81)
The resulting function values are of same size as the meshgrid. Taking X , Y , F jointly results in a list of approximately 6400 triplets representing the x-, y-, and z-coordinates in the three-dimensional space where the function should be plotted. The 3d-plot can now be constructed easily by means of the function scatter of matplotlib.pyplot .
In [6]:
Out[6]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f0ac146c2b0>
The parameter s and alpha control the size and the transparency of each data point. Other 3d plotting variants exist (e.g. surface plots), however, the scatter plot is the simplest approach at least conceptually. Having introduced how to easily plot 3D functions in Python, we can now analyze twodimensional probability distributions with this same tool.
Exercise 1: Gaussian distributions (5+5+5 P)
Using the technique introduced above, we would like to plot a normal Gaussian probability distribution with mean vector $\mu = (0,0)$, and covariance matrix $\Sigma = I$ also known as standard normal distribution. We consider the same discretization as above (i.e. a grid from -4 to 4 using step size 0.1). For two-dimensional input spaces, the standard normal distribution is given by: $$ p(x,y) = \frac{1}{2\pi}e^{-0.5 (x^2+y^2)}. $$ This distribution sums to $1$ when integrated over $\mathbb{R}^2$. However, it does not sum to $1$ when summing over the discretized space (i.e. the grid). Instead, we can work with a discretized Gaussian-like distribution: $$ P(x,y) = \frac1Z e^{-0.5 (x^2+y^2)} \qquad \text{with} \quad Z = \sum_{x,y} e^{-0.5 (x^2+y^2)} $$ where the sum runs over the whole discretized space.
Compute the distribution $P(x,y)$, and plot it.
Compute the conditional distribution $Q(x,y) = P((x,y) | \sqrt{x^2+y^2} \geq 1)$, and plot it.
Marginalize the conditioned distribution $Q(x,y)$ over $y$, and plot the resulting distribution $Q(x)$.
[7]:
solutions.s1a()
###
[9]:
solutions.s1c()
###
Exercise 2: Bayesian Classification (5+5+5 P)
Let the two coordinates x and y be now representated as a two-dimensional vector $\boldsymbol{x}$. We consider two classes $\omega_1$ and
$\omega_2$ with data-generating Gaussian distributions $p(\boldsymbol{x}|\omega_1)$ and $p(\boldsymbol{x}|\omega_2)$ of mean vectors
$$\boldsymbol{\mu}_1 = (-0.5,-0.5) \quad \text{and} \quad \boldsymbol{\mu}_2 = (0.5,0.5)$$ respectively, and same covariance matrix $$\Sigma = \begin{pmatrix}1.0&0\\0&0.5\end{pmatrix}.$$ Classes occur with probability $P(\omega_1) = 0.9$ and $P(\omega_2) = 0.1$. Analysis tells us that in such scenario, the optimal decision boundary between the two classes should be linear. We would like to verify this computationally by applying Bayes decision theory on grid-like discretized distributions.
Using the same grid as in Exercise 1, discretize the two data-generating distributions $p(\boldsymbol{x}|\omega_1)$ and
$p(\boldsymbol{x}|\omega_2)$ (i.e. create discrete distributions $P(\boldsymbol{x}|\omega_1)$ and $P(\boldsymbol{x}|\omega_2)$ on the grid), and plot them with different colors.
From these distributions, compute the total probability distribution $P(\boldsymbol{x}) = \sum_{c \in \{1,2\}} P(\boldsymbol{x} | \omega_c) \cdot P(\omega_c)$, and plot it.
Compute and plot the class posterior probabilities $P(\omega_1|\boldsymbol{x})$ and $P(\omega_2|\boldsymbol{x})$, and print the Bayes error rate for the discretized case.
[11]:
solutions.s2b()
###
Exercise 3: Reducing the Variance (5+5 P)
Suppose that the data generating distribution for the second class changes to produce samples much closer to the mean. This variance reduction for the second class is implemented by keeping the first covariance the same (i.e. $\Sigma_1 = \Sigma$) and dividing the second covariance matrix by 4 (i.e.
$\Sigma_2 = \Sigma/4$). For this new set of parameters, we can perform the same analysis as in Exercise 2.
Plot the new class posterior probabilities $P(\omega_1|\boldsymbol{x})$ and $P(\omega_2|\boldsymbol{x})$ associated to the new covariance matrices, and print the new Bayes error rate.
[13]:
solutions.s3a()
###
Bayes error rate: 0.073
Intuition tells us that by variance reduction and resulting concentration of generated data for class 2 in a smaller region of the input space, it should be easier to predict class 2 with certainty at this location. Paradoxally, in this new "dense" setting, we observe that class 2 does not reach full certainty anywhere in the input space, whereas it did in the previous exercise.
Explain this paradox.
In [14]: