Welcome to your first CS221 assignment! The goal of this assignment is to sharpen your math and programming skills needed for this class. If you meet the prerequisites, you should find these problems relatively innocuous. Some of these problems will occur again as subproblems of later homeworks, so make sure you know how to do them. If you're unsure about them or need a refresher, we recommend going through our prerequisites module or other resources on the Internet, or coming to office hours.

Before you get started, please read the Homeworks section on the course website thoroughly.

## Problem 1: Optimization and probability

In this class, we will cast a lot of AI problems as optimization problems, that is, finding the best solution in a rigorous mathematical sense. At the same time, we must be adroit at coping with uncertainty in the world, and for that, we appeal to tools from probability.

1. Let $x_1, \dots, x_n$ be real numbers representing positions on a number line. Let $w_1, \dots, w_n$ be positive real numbers representing the importance of each of these positions. Consider the quadratic function: $f(\theta) = \frac{1}{2} \sum_{i=1}^n w_i (\theta - x_i)^2$. Note that $\theta$ here is a scalar. What value of $\theta$ minimizes $f(\theta)$? Show that the optimum you find is indeed a minimum. What problematic issues could arise if some of the $w_i$'s are negative?

Note: You can think about this problem as trying to find the point $\theta$ that's not too far away from the $x_i$'s. Over time, hopefully you'll appreciate how nice quadratic functions are to minimize.
An expression for the value of $\theta$ that minimizes $f(\theta)$ and how you got it. A short calculation/argument to show that it is a minimum. 1-2 sentences describing a problem that could arise if some of the $w_i$'s are negative.
2. In this class, there will be a lot of sums and maxes. Let's see what happens if we switch the order. Let $f(\mathbf x) = \sum_{i=1}^d \max_{s_i \in \{1,-1\}} s_i x_i$ and $g(\mathbf x) = \max_{s \in \{1,-1\}} \sum_{i=1}^d s x_i$, where $\mathbf x = (x_1, \dots, x_d) \in \mathbb{R}^d$ is a real vector. Which of $f(\mathbf x) \le g(\mathbf x)$, $f(\mathbf x) = g(\mathbf x)$, or $f(\mathbf x) \ge g(\mathbf x)$ is true for all $\mathbf x$? Prove it.

Hint: You may find it helpful to refactor the expressions so that they are maximizing the same quantity over different sized sets.
A short (3-5) line/sentence proof. You should use mathematical notation in your proof, but can also make your argument in words.
3. Suppose you repeatedly roll a fair six-sided die until you roll a $1$ (and then you stop). Every time you roll a $2$, you lose $a$ points, and every time you roll a 6, you win $b$ points. You do not win or lose any points if you roll a 3, 4, or a 5. What is the expected number of points (as a function of $a$ and $b$) you will have when you stop?

Hint: You will find it helpful to define a recurrence. If you define $V$ as the expected number of points you get from playing the game, what happens if you roll a 2? You lose $a$ points and then get to play again. What about the other cases? Can you write this as a recurrence?
A recurrence to represent the problem and the resulting expression from solving the recurrence (no more than 1-2 lines).
4. Suppose the probability of a coin turning up heads is $0 \lt p \lt 1$, and we flip it 7 times and get $\{ \text{H}, \text{H}, \text{T}, \text{H}, \text{T} , \text{T}, \text{H} \}$. We know the probability (likelihood) of obtaining this sequence is $L(p) = p p (1-p) p (1-p) (1-p) p = p^4(1-p)^3$. What value of $p$ maximizes $L(p)$? Prove/Show that this value of $p$ maximizes $L(p)$. What is an intuitive interpretation of this value of $p$?

Hint: Consider taking the derivative of $\log L(p)$. You can also directly take the derivative of $L(p)$, but it is cleaner and more natural to differentiate $\log L(p)$. You can verify for yourself that the value of $p$ which maximizes $\log L(p)$ must also maximize $L(p)$ (you are not required to prove this in your solution).
The value of $p$ that maximizes $L(p)$ and the work/calculation used to solve for it. Note that you must prove/show that it is a maximum. A 1-sentence intuitive interpretation of the value of $p$.
5. Now for a little bit of practice manipulating conditional probabilities. Suppose that $A$ and $B$ are two events such that $P(A|B) = P(B|A)$. We also know that and $P(A \cup B) = 1$ and $P(A \cap B) > 0$. Prove that $P(A) > \frac{1}{2}$.

Hint: Note that $A$ and $B$ are not necessarily mutually exclusive. Consider how we can relate $P(A \cup B)$ and $P(A \cap B)$.
A short (~5 line) proof/derivation.
6. Let's practice taking gradients, which is a key operation for being able to optimize continuous functions. For $\mathbf w \in \mathbb R^d$ (represented as a column vector) and constants $\mathbf a_i, \mathbf b_j \in \mathbb R^d$ (also represented as column vectors) and $\lambda \in \mathbb R$, define the scalar-valued function $$f(\mathbf w) = \Bigg( \sum_{i=1}^n \sum_{j=1}^n (\mathbf a_i^\top \mathbf w - \mathbf b_j^\top \mathbf w)^2 \Bigg) + \lambda \|\mathbf w\|_2^2,$$ where the vector is $\mathbf w = (w_1, \dots, w_d)^\top$ and $\|\mathbf w\|_2 = \sqrt{\sum_{k=1}^d w_k^2} = \sqrt{{\mathbf w}^T {\mathbf w}}$ is known as the $L_2$ norm. Compute the gradient $\nabla f(\mathbf w)$.

Recall: the gradient is a $d$-dimensional vector of the partial derivatives with respect to each $w_i$: $$\nabla f(\mathbf w) = \left(\frac{\partial f(\mathbf w)}{\partial w_1}, \dots \frac{\partial f(\mathbf w)}{\partial w_d}\right)^\top.$$ If you're not comfortable with vector calculus, first warm up by working out this problem using scalars in place of vectors and derivatives in place of gradients. Not everything for scalars goes through for vectors, but the two should at least be consistent with each other (when $d=1$). Do not write out summations over dimensions, because that gets tedious.
An expression for the gradient and the work used to derive it. (~5 lines). No need to expand out terms unnecessarily; try to write the final answer compactly.

## Problem 2: Complexity

When designing algorithms, it's useful to be able to do quick back-of-the-envelope calculations to see how much time or space an algorithm needs. Hopefully, you'll start to get more intuition for this by being exposed to different types of problems.

1. Suppose we have an $n \times n$ grid, where we'd like to place 6 arbitrary axis-aligned rectangles (i.e., the sides of the rectangle are parallel to the axes). There are no constraints on the location or size of the rectangles. For example, it is possible for all four corners of a single rectangle to be the same point (resulting in a rectangle of size 0) or for all 6 rectangles to be on top of each other. How many possible ways are there to place 6 rectangles on the grid? In general, we only care about asymptotic complexity, so give your answer in the form of $O(n^c)$ or $O(c^n)$ for some integer $c$.

Note: It is unnecessary to consider whether order matters in this problem, since we are asking for asymptotic complexity. You are free to assume either in your solution, as it doesn’t change the final answer.
A big-O bound for the number of possible ways to place 6 rectangles and some simple explanation/reasoning for the answer (~2 sentences).
2. Suppose we have an $n\times n$ grid. We start in the upper-left corner (position $(1,1)$), and we would like to reach the lower-right corner (position $(n,n)$) by taking single steps down or to the right. Suppose we are provided with a function $c(i, j)$ that outputs the cost of touching position $(i, j)$, and assume it takes constant time to compute for each position. Note that $c(i, j)$ can be negative. Give an algorithm for computing the cost of the minimum-cost path from $(1,1)$ to $(n,n)$ in the most efficient way (with smallest big-O time complexity). What is the runtime (just give the big-O)?
A description of the algorithm for computing the cost of the minimum-cost path as efficiently as possible (~5 sentences). The big-O runtime and a short explanation of how it arises from the algorithm.

## Problem 3: Programming

In this problem, you will implement a bunch of short functions. The main purpose of this exercise is to familiarize yourself with Python, but as a bonus, the functions that you will implement will come in handy in subsequent homeworks.

Do not import any outside libraries (e.g. numpy). Only standard Python libraries and/or the libraries imported in the starter code are allowed.

If you're new to Python, the following provide pointers to various tutorials and examples for the language:

Python code implementing the functions provided in submission.py. Try to make your code as clean and simple as possible and be sure to write your answers between the begin answer and end answer comments.
1. Implement find_alphabetically_last_word in submission.py.
2. Implement euclidean_distance in submission.py.
3. Implement mutate_sentences in submission.py.
4. Implement sparse_vector_dot_product in submission.py.
5. Implement increment_sparse_vector in submission.py.
6. Implement find_singleton_words in submission.py.

## Problem 4: Background Survey

As a big class and the first AI/ML course taken by many students, CS221 has a wide range of people coming from different backgrounds. On top of that, the transition to remote learning has created new challenges for all of us. We know that these challenges affect students unevenly and to different degrees depending on their individual circumstances. As a result, we've put together a quick survey to get a little more information about the background of everyone in the class and figure out what we can do to best support each one of you this quarter.

This survey is optional and will not be graded, so your answers will not affect your grade whatsoever. However, it will be a huge help to us in making CS221 a great experience for everyone!

Just fill out the survey!