Foundations

Welcome to your first CS221 assignment! The goal of this assignment is to sharpen your math and programming skills needed for this class. If you meet the prerequisites, you should find these problems relatively innocuous. Some of these problems will occur again as subproblems of later homeworks, so make sure you know how to do them.

Problem 1: Optimization and probability

In this class, we will cast a lot of AI problems as optimization problems, that is, finding the best solution in a rigorous mathematical sense. At the same time, we must be adroit at coping with uncertainty in the world, and for that, we appeal to tools from probability.

Let $x_1, \dots, x_n$ be real numbers representing positions on a number line. Let $w_1, \dots, w_n$ be positive real numbers representing the importance of each of these positions. Consider the quadratic function: $f(\theta) = \frac{1}{2} \sum_{i=1}^n w_i (\theta - x_i)^2$. What value of $\theta$ minimizes $f(\theta)$? What problematic issues could arise if some of the $w_i$'s are negative?
Note: You can think about this problem as trying to find the point $\theta$ that's not too far away from the $x_i$'s. Over time, hopefully you'll appreciate how nice quadratic functions are to minimize.
In this class, there will be a lot of sums and maxes. Let's see what happens if we switch the order. Let $f(\mathbf x) = \sum_{i=1}^d \max_{s \in \{1,-1\}} s x_i$ and $g(\mathbf x) = \max_{s \in \{1,-1\}} \sum_{i=1}^d s x_i$, where $\mathbf x = (x_1, \dots, x_d) \in \mathbb{R}^d$ is a real vector. Does $f(\mathbf x) \le g(\mathbf x)$, $f(\mathbf x) = g(\mathbf x)$, or $f(\mathbf x) \ge g(\mathbf x)$ hold for all $\mathbf x$? Prove it.
Hint: You may find it helpful to refactor the expressions so that they are maximizing the same quantity over different sized sets.
Suppose you repeatedly roll a fair six-sided die until you roll a $1$ (and then you stop). Every time you roll a $2$, you lose $a$ points, and every time you roll a 6, you win $b$ points. You do not win or lose any points if you roll a 3, 4, or a 5. What is the expected number of points (as a function of $a$ and $b$) you will have when you stop?
Hint: it is recommended to think of defining a recurrence.
Suppose the probability of a coin turning up heads is $0 \lt p \lt 1$, and that we flip it 7 times and get $\{ \text{H}, \text{H}, \text{T}, \text{H}, \text{T} , \text{T}, \text{H} \}$. We know the probability (likelihood) of obtaining this sequence is $L(p) = p p (1-p) p (1-p) (1-p) p = p^4(1-p)^3$. What value of $p$ maximizes $L(p)$? What is an intuitive interpretation of this value of $p$?
Hint: Consider taking the derivative of $\log L(p)$. You can also directly take the derivative of $L(p)$, but it is cleaner and more natural to differentiate $\log L(p)$. You can verify for yourself that the value of $p$ which maximizes $\log L(p)$ must also maximize $L(p)$ (you are not required to prove this in your solution).
Let's practice taking gradients, which is a key operation for being able to optimize continuous functions. For $\mathbf w \in \mathbb R^d$ (represented as a column vector) and constants $\mathbf a_i, \mathbf b_j \in \mathbb R^d$ (also represented as column vectors) and $\lambda \in \mathbb R$, define the scalar-valued function $$f(\mathbf w) = \sum_{i=1}^n \sum_{j=1}^n (\mathbf a_i^\top \mathbf w - \mathbf b_j^\top \mathbf w)^2 + \lambda \|\mathbf w\|_2^2,$$ where the vector is $\mathbf w = (w_1, \dots, w_d)^\top$ and $\|\mathbf w\|_2 = \sqrt{\sum_{k=1}^d w_k^2}$ is known as the $L_2$ norm. Compute the gradient $\nabla f(\mathbf w)$.
Recall: the gradient is a $d$-dimensional vector of the partial derivatives with respect to each $w_i$: $$\nabla f(\mathbf w) = \left(\frac{\partial f(\mathbf w)}{\partial w_1}, \dots \frac{\partial f(\mathbf w)}{\partial w_d}\right)^\top.$$ If you're not comfortable with vector calculus, first warm up by working out this problem using scalars in place of vectors and derivatives in place of gradients. Not everything for scalars goes through for vectors, but the two should at least be consistent with each other (when $d=1$). Do not write out summation over dimensions, because that gets tedious.

Problem 2: Complexity

When designing algorithms, it's useful to be able to do quick back of the envelope calculations to see how much time or space an algorithm needs. Hopefully, you'll start to get more intuition for this by being exposed to different types of problems.

Suppose we have an image of a human face consisting of $n \times n$ pixels. In our simplified setting, a face consists of two eyes, two ears, one nose, and one mouth, each represented as an arbitrary axis-aligned rectangle (i.e. the axes of the rectangle are aligned with the axes of the image). As we'd like to handle Picasso portraits too, there are no constraints on the location or size of the rectangles. How many possible faces (choice of its component rectangles) are there? In general, we only care about asymptotic complexity, so give your answer in the form of $O(n^c)$ or $O(c^n)$ for some integer $c$.
Suppose we have an $n\times n$ grid. We start in the upper-left corner (position $(1,1)$), and we would like to reach the lower-right corner (position $(n,n)$) by taking single steps down and right. Define a function $c(i, j)$ to be the cost of touching position $(i, j)$, and assume it takes constant time to compute. Note that $c(i, j)$ can be negative. Give an algorithm for computing the minimum cost in the most efficient way. What is the runtime (just give the big-O)?
Suppose we have a staircase with $n$ steps (we start on the ground, so we need $n$ total steps to reach the top). We can take as many steps forward at a time, but we will never step backwards. How many ways are there to reach the top? Give your answer as a function of $n$.
For example: if $n = 3$, then the answer is $4$. The four options are the following: (1) take one step, take one step, take one step (2) take two steps, take one step (3) take one step, take two steps (4) take three steps.
Consider the scalar-valued function $f(\mathbf w)$ from Problem 1e. Devise a strategy that first does preprocessing in $O(n d^2)$ time, and then for any given vector $\mathbf w$, takes $O(d^2)$ time instead to compute $f(\mathbf w)$.
Hint: Refactor the algebraic expression; this is a classic trick used in machine learning. Again, you may find it helpful to work out the scalar case first.

Problem 3: Programming

In this problem, you will implement a bunch of short functions. The main purpose of this exercise is to familiarize yourself with Python, but as a bonus, the functions that you will implement will come in handy in subsequent homeworks.

If you're new to Python, the following provide pointers to various tutorials and examples for the language:

Implement findAlphabeticallyLastWord in submission.py.
Implement euclideanDistance in submission.py.
Implement mutateSentences in submission.py.
Implement sparseVectorDotProduct in submission.py.
Implement incrementSparseVector in submission.py.
Implement findSingletonWords in submission.py.
Implement computeLongestPalindromeLength in submission.py.

Problem 4: Societal impact

AI can no longer be viewed as a neutral technology, for its impact on society is increasing [1, 2]. While most of this course is the technology, we would like you to pause for a moment to reflect on how this technology can influence people's lives.

Describe one positive use of AI to improve society that you are most excited about. What would be a concrete grand challenge or milestone in this area?
Describe one negative use or risks of AI that you are most worried about. How would you go about preventing this?

Problem 5: Background survey

As a big class, CS221 has a wide range people coming from different backgrounds. To get a better sense of who you are, we would like you to tell us what you know so far.

Which of the following courses have you taken?
- Basic programming (CS106 or equivalent)
- Systems programming (CS107 or equivalent)
- Probability (CS109 or equivalent)
- Math class that requires proofs (CS103, etc.)
- Algorithms (CS161 or equivalent)
- Linear algebra (MATH104 or equivalent)
- Machine learning (CS229 or equivalent)
- NLP, vision, or robotics
For each of the following concepts, on a scale of 0-6, please rate your expertise.
Ratings:
- 0: I never heard of it
- 1: I don't know what it is
- 2: I have a vague idea
- 3: I have a basic understanding
- 4: I would be comfortable teaching someone
- 5: I have worked extensively with it in my research or work
- 6: I invented it
Concepts:
- Anonymous functions (lambdas)
- Gradient of a vector-valued function
- Conditional independence
- Dynamic programming
- Logistic regression
- Cross validation
- Beam search
- Existential quantification
- Convexity