Welcome to your first CS221 assignment!
The goal of this assignment is to sharpen your math and programming skills
needed for this class. If you meet the prerequisites, you should find these
problems relatively innocuous. Some of these problems will occur again
as subproblems of later homeworks, so make sure you know how to do them.
If you're unsure about them or need a refresher,
we recommend going through our prerequisites module or other resources on the Internet,
or coming to office hours.
Before you get started, please read the Homeworks section on the course website thoroughly.
Problem 1: Optimization and probability
In this class, we will cast a lot of AI problems as optimization problems, that is, finding the best
solution in a rigorous mathematical sense.
At the same time, we must be adroit at coping with uncertainty in the world,
and for that, we appeal to tools from probability.

Let $x_1, \dots, x_n$ be real numbers representing positions on a number line.
Let $w_1, \dots, w_n$ be positive real numbers representing the importance of each of these positions.
Consider the quadratic function: $f(\theta) = \frac{1}{2} \sum_{i=1}^n w_i (\theta  x_i)^2$. Note that $\theta$ here is a scalar.
What value of $\theta$ minimizes $f(\theta)$? Show that the optimum you find is indeed a minimum. What
problematic issues could arise if some of the $w_i$'s are negative?
Note: You can think about this problem as trying to find the point $\theta$ that's not too far
away from the $x_i$'s. Over time, hopefully you'll appreciate how nice quadratic functions are to minimize.
An expression for the value of $\theta$ that minimizes $f(\theta)$ and how you got it. A short calculation/argument to show that it is a minimum. 12 sentences describing a problem that could arise if some of the $w_i$'s are negative.

In this class, there will be a lot of sums and maxes.
Let's see what happens if we switch the order.
Let $f(\mathbf x) = \sum_{i=1}^d \max_{s_i \in \{1,1\}} s_i x_i$
and $g(\mathbf x) = \max_{s \in \{1,1\}} \sum_{i=1}^d s x_i$,
where $\mathbf x = (x_1, \dots, x_d) \in \mathbb{R}^d$ is a real vector.
Which of $f(\mathbf x) \le g(\mathbf x)$, $f(\mathbf x) = g(\mathbf x)$, or $f(\mathbf x) \ge g(\mathbf x)$ is true for
all $\mathbf x$?
Prove it.
Hint: You may find it helpful to refactor the expressions so that they
are maximizing the same quantity over different sized sets.
A short (35) line/sentence proof. You should use mathematical notation in your proof, but can also make your argument in words.

Suppose you repeatedly roll a fair sixsided die until you roll a $1$ (and then you stop).
Every time you roll a $2$, you lose $a$ points, and every time you roll a 6, you win $b$ points. You do not win
or lose any points if you roll a 3, 4, or a 5.
What is the expected number of points (as a function of $a$ and $b$) you will have when you stop?
Hint: You will find it helpful to define a recurrence. If you define $V$ as the expected number of points you get from playing the game, what happens if you roll a 2? You lose $a$ points and then get to play again. What about the other cases? Can you write this as a recurrence?
A recurrence to represent the problem and the resulting expression from solving the recurrence (no more than 12 lines).

Suppose the probability of a coin turning up heads is $0 \lt p \lt 1$,
and we flip it 7 times and get $\{ \text{H}, \text{H}, \text{T}, \text{H}, \text{T} , \text{T}, \text{H}
\}$.
We know the probability (likelihood) of obtaining this
sequence is $L(p) = p p (1p) p (1p) (1p) p = p^4(1p)^3$.
What value of $p$ maximizes $L(p)$?
Prove/Show that this value of $p$ maximizes $L(p)$.
What is an intuitive interpretation of this value of $p$?
Hint: Consider taking the derivative of $\log L(p)$. You can also directly take the derivative of $L(p)$,
but it is cleaner and more natural to differentiate $\log L(p)$. You can verify for yourself
that the value of $p$ which maximizes $\log L(p)$ must also maximize $L(p)$ (you are not required to prove this
in your solution).
The value of $p$ that maximizes $L(p)$ and the work/calculation used to solve for it. Note that you must prove/show that it is a maximum. A 1sentence intuitive interpretation of the value of $p$.

Now for a little bit of practice manipulating conditional probabilities. Suppose that $A$ and $B$ are two events such that $P(AB) = P(BA)$. We also know that and $P(A \cup B) = 1$ and $P(A \cap B) > 0$. Prove that $P(A) > \frac{1}{2}$.
Hint: Note that $A$ and $B$ are not necessarily mutually exclusive. Consider how we can relate $P(A \cup B)$ and $P(A \cap B)$.
A short (~5 line) proof/derivation.

Let's practice taking gradients,
which is a key operation for being able to optimize continuous functions.
For $\mathbf w \in \mathbb R^d$ (represented as a column vector) and constants $\mathbf a_i, \mathbf b_j \in
\mathbb R^d$ (also represented as column vectors) and $\lambda \in \mathbb R$, define
the scalarvalued function
$$f(\mathbf w) = \Bigg( \sum_{i=1}^n \sum_{j=1}^n (\mathbf a_i^\top \mathbf w  \mathbf b_j^\top \mathbf w)^2 \Bigg) + \lambda
\\mathbf w\_2^2,$$
where the vector is $\mathbf w = (w_1, \dots, w_d)^\top$ and $\\mathbf w\_2 = \sqrt{\sum_{k=1}^d w_k^2} = \sqrt{{\mathbf w}^T {\mathbf w}}$ is
known as the $L_2$ norm.
Compute the gradient $\nabla f(\mathbf w)$.
Recall: the gradient is a $d$dimensional vector of the partial derivatives with respect to each $w_i$:
$$\nabla f(\mathbf w) = \left(\frac{\partial f(\mathbf w)}{\partial w_1}, \dots \frac{\partial f(\mathbf
w)}{\partial w_d}\right)^\top.$$
If you're not comfortable with vector calculus, first warm up by working out this problem using scalars in
place of vectors and derivatives in place of gradients.
Not everything for scalars goes through for vectors, but the two should at least be consistent with each other
(when $d=1$).
Do not write out summations over dimensions, because that gets tedious.
An expression for the gradient and the work used to derive it. (~5 lines). No need to expand out terms unnecessarily; try to write the final answer compactly.
Problem 2: Complexity
When designing algorithms, it's useful to be able to do quick backoftheenvelope
calculations to see how much time or space an algorithm needs.
Hopefully, you'll start to get more intuition for this by being exposed
to different types of problems.

Suppose we have an $n \times n$ grid, where we'd like to place 6 arbitrary axisaligned rectangles (i.e., the sides of the rectangle are parallel to the axes).
There are no constraints on the location or size of the rectangles. For example, it is possible for all four corners of a single rectangle to be the same point (resulting in a rectangle of size 0) or for all 6 rectangles to be on top of each other.
How many possible ways are there to place 6 rectangles on the grid?
In general, we only care about asymptotic complexity,
so give your answer in the form of $O(n^c)$ or $O(c^n)$ for some integer $c$.
Note: It is unnecessary to consider whether order matters in this problem, since we are asking for asymptotic complexity. You are free to assume either in your solution, as it doesnâ€™t change the final answer.
A bigO bound for the number of possible ways to place 6 rectangles and some simple explanation/reasoning for the answer (~2 sentences).

Suppose we have an $n\times n$ grid.
We start in the upperleft corner (position $(1,1)$), and we would like to reach the lowerright corner
(position $(n,n)$) by taking single steps down or to the right.
Suppose we are provided with a function $c(i, j)$ that outputs the cost of touching position $(i, j)$, and assume it takes constant time to
compute for each position.
Note that $c(i, j)$ can be negative.
Give an algorithm for computing the cost of the minimumcost path from $(1,1)$ to $(n,n)$ in the most efficient way (with smallest bigO time complexity).
What is the runtime (just give the bigO)?
A description of the algorithm for computing the cost of the minimumcost path as efficiently as possible (~5 sentences). The bigO runtime and a short explanation of how it arises from the algorithm.
Problem 3: Programming
In this problem, you will implement a bunch of short functions. The main
purpose of this exercise is to familiarize yourself with Python,
but as a bonus, the functions that you will implement will come in handy in
subsequent homeworks.
Do not import any outside libraries (e.g. numpy). Only standard Python
libraries and/or the libraries imported in the starter code are allowed.
If you're new to Python, the following provide pointers to various
tutorials and examples for the language:
Python code implementing the functions provided in submission.py. Try to make your code as clean and simple as possible and be sure to write your answers between the begin answer and end answer comments.

Implement
find_alphabetically_last_word
in submission.py
.

Implement
euclidean_distance
in submission.py
.

Implement
mutate_sentences
in submission.py
.

Implement
sparse_vector_dot_product
in submission.py
.

Implement
increment_sparse_vector
in submission.py
.

Implement
find_singleton_words
in submission.py
.
Problem 4: Background Survey
As a big class and the first AI/ML course taken by many students, CS221 has a wide range of people coming from different backgrounds.
On top of that, the transition to remote learning has created new challenges for all of us. We know that these challenges affect students unevenly and to different degrees depending on their individual circumstances.
As a result, we've put together a quick survey to get a little more information about the background of everyone in the class and figure out what we can do to best support each one of you this quarter.
Please fill out the Google form here.
This survey is optional and will not be graded, so your answers will not affect your grade whatsoever.
However, it will be a huge help to us in making CS221 a great experience for everyone!
Just fill out the survey!
Submission
Submission is done on Gradescope.
Written: When submitting the written parts, make sure to select all the pages
that contain part of your answer for that problem, or else you will not get credit.
To double check after submission, you can click on each problem link on the right side, and it should show
the pages that are selected for that problem.
Programming: After you submit, the autograder will take a few minutes to run. Check back after
it runs to make sure that your submission succeeded. If your autograder crashes, you will receive a 0 on the
programming part of the assignment.
More details can be found in the Submission section on the course website.