Problem 1: Optimization and probability
We will cast a lot of AI problems as optimization problems, that is, finding the best solution in a rigorous mathematical sense. At the same time, we must be adroit at coping with uncertainty in the world, and for that, we appeal to tools from probability.
Let x1,…,xn be real numbers representing positions on a number line. Let w1,…,wn be positive real numbers representing the importance of each of these positions. Consider the quadratic function: f(θ)=12∑ni=1wi(θ−xi)2. What value of θ minimizes f(θ)? You can think about this problem as trying to find the point θ that's not too far away from the xi's. Over time, hopefully you'll appreciate how nice quadratic functions are to minimize. What problematic issues could arise if some of the wi's are negative?
In this class, there will be a lot of sums and maxes. Let's see what happens if we switch the order. Let f(x)=∑di=1maxs∈{1,−1}sxi and g(x)=maxs∈{1,−1}∑di=1sxi, where x=(x1,…,xd)∈Rd is a real vector. Does f(x)≤g(x), f(x)=g(x), or f(x)≥g(x) hold for all x? Prove it.
Hint: You may find it helpful to refactor the expressions so they are maximizing the same quantity over different sized sets.
Suppose you repeatedly roll a fair six-sided die until you roll a 1 (and then you stop). Every time you roll a 2, you lose a points, and every time you roll a 6, you win b points. What is the expected number of points (as a function of a and b) you will have when you stop?
Suppose the probability of a coin turning up heads is 0<p<1, and that we flip it 7 times and get {H,H,T,H,T,T,H}. We know the probability (likelihood) of obtaining this sequence is L(p)=pp(1−p)p(1−p)(1−p)p=p4(1−p)3. Now let's go back and ask the question: what value of p maximizes L(p)? What is an intuitive interpretation of this value of p?
Hint: Consider taking the derivative of logL(p). You can also directly take the derivative of L(p), but it is cleaner and more natural to differentiate logL(p). You can verify for yourself that the value of p which maximizes logL(p) must also maximize L(p) (you are not required to prove this in your solution).
Let's practice taking gradients, which is a key operation for being able to optimize continuous functions. For w∈Rd (represented as a column vector) and constants ai,bj∈Rd (also represented as column vectors) and λ∈R, define the scalar-valued functionf(w)=∑i=1n∑j=1n(a⊤iw−b⊤jw)2+λ∥w∥22,
where the vector is w=(w1,…,wd)⊤ and ∥w∥2=∑dk=1w2k−−−−−−−√ is known as the L2 norm. Compute the gradient ∇f(w).
Recall: the gradient is a d-dimensional vector of the partial derivatives with respect to each wi:∇f(w)=(∂f(w)∂w1,…∂f(w)∂wd)⊤.
If you're not comfortable with vector calculus, first warm up by working out this problem using scalars in place of vectors and derivatives in place of gradients. Not everything for scalars goes through for vectors, but the two should at least be consistent with each other (when d=1). Do not write out summation over dimensions, because that gets tedious.
Problem 2: Complexity
When designing algorithms, it's useful to be able to do quick back of the envelope calculations to see how much time or space an algorithm needs. Hopefully, you'll start to get more intuition for this by being exposed to more types of problems.
Suppose we have an image of a human face consisting of n×n pixels. In our simplified setting, a face consists of two eyes, two ears, one nose, and one mouth, each represented as an arbitrary axis-aligned rectangle (i.e. the axes of the rectangle are aligned with the axes of the image). As we'd like to handle Picasso portraits too, there are no constraints on the location or size of the rectangles. How many possible faces (choice of its component rectangles) are there? In general, we only care about asymptotic complexity, so give your answer in the form of O(nc) or O(cn) for some integer c.
Suppose we have an n×n grid. We start in the upper-left corner (coordinate (1,1)), and we would like to reach the lower-right corner (coordinate (n,n)) by taking single steps down and right. Define a function c(i,j) to be the cost of touching square (i,j), and assume it takes constant time to compute. Note that c(i,j) can be negative. Give an algorithm for computing the minimum cost in the most efficient way. What is the runtime (just give the big-O)?
Suppose we have a staircase with n steps (we start on the ground, so we need n total steps to reach the top). We can take as many steps forward at a time, but we will never step backwards. How many ways are there to reach the top? Give your answer as a function of n. For example, if n=3, then the answer is 4. The four options are the following: (1) take one step, take one step, take one step (2) take two steps, take one step (3) take one step, take two steps (4) take three steps.
Consider the scalar-valued function f(w) from Problem 1e. Devise a strategy that first does preprocessing in O(nd2) time, and then for any given vector w, takes O(d2) time instead to compute f(w).
Hint: Refactor the algebraic expression; this is a classic trick used in machine learning. Again, you may find it helpful to work out the scalar case first.
Problem 3: Programming
In this problem, you will implement a bunch of short functions. The main purpose of this exercise is to familiarize yourself with Python, but as a bonus, the functions that you will implement will come in handy in subsequent homeworks.
If you're new to Python, the following provide pointers to various tutorials and examples for the language:
Python for Programmers
Example programs of increasing complexity
Implement findAlphabeticallyLastWord in submission.py.
Implement euclideanDistance in submission.py.
Implement mutateSentences in submission.py.
Implement sparseVectorDotProduct in submission.py.
Implement incrementSparseVector in submission.py.
Implement findSingletonWords in submission.py.
Implement computeLongestPalindromeLength in submission.py.