1. Words are simply strings separated by whitespace. Note that words which only differ in capitalization are considered separate (e.g. great and Great are considered different words).
2. You might find some useful functions in util.py. Have a look around in there before you start coding.
3. We also recommend you check the Piazza release post for this homework here for any corrections/clarifications made post-release.
Problem 1: Building intuition

Here are two reviews of Frozen, courtesy of Rotten Tomatoes (no spoilers!):

Rotten Tomatoes has classified these reviews as "positive" and "negative,", respectively, as indicated by the intact tomato on the left and the splattered tomato on the right. In this assignment, you will create a simple text classification system that can perform this task automatically.

We'll warm up with the following set of four mini-reviews, each labeled positive ($+1$) or negative ($-1$):
1. ($-1$) pretty bad
2. ($+1$) good plot
3. ($-1$) not good
4. ($+1$) pretty scenery
Each review $x$ is mapped onto a feature vector $\phi(x)$, which maps each word to the number of occurrences of that word in the review. For example, the first review maps to the (sparse) feature vector $\phi(x) = \{\text{pretty}:1, \text{bad}:1\}$. Recall the definition of the hinge loss: $$\text{Loss}_{\text{hinge}}(x, y, \mathbf{w}) = \max \{0, 1 - \mathbf{w} \cdot \phi(x) y\},$$ where $y$ is the correct label.
1. Suppose we run stochastic gradient descent, updating the weights according to $$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w} \text{Loss}_{\text{hinge}}(x, y, \mathbf{w}),$$ once for each of the four examples in the order given above. After the classifier is trained on the given four data points, what are the weights of the six words ("pretty", "good", "bad", "plot", "not", "scenery") that appear in the above reviews? Please label your weight vector, indicating which word corresponds to each index. Use $\eta = .5$ as the step size and initialize $\mathbf{w} = [0, ..., 0]$. Assume that $\nabla_\mathbf{w} \text{Loss}_{\text{hinge}}(x, y, \mathbf{w}) = 0$ when the margin is exactly 1.
2. A weight vector that contains a numerical value for each of the tokens in the reviews ("pretty", "good", "bad", "plot", "not", "scenery"). Please label your weight vector, indicating which word corresponds to each index.
3. Create a small labeled dataset of four mini-reviews using the words "not", "good", and "bad", where the labels make intuitive sense. Each review should contain one or two words, and no repeated words. No two reviews in the dataset should be the same.

Prove that no linear classifier using word features can get zero error on your dataset. Remember that this is a question about classifiers, not optimization algorithms; your proof should be true for any linear classifier, regardless of how the weights are learned.

Propose a single additional feature for your dataset that we could augment the feature vector with that would fix this problem. What is the resulting weight vector for your dataset when you introduce this new feature? (Hint: think about the linear effect that each feature has on the classification score.)
4. Your answer should include: 1. a small dataset of reviews (the dataset should only contain examples using the words "not", "good", and "bad"), 2. a short written proof (~3-5 sentences), 3. a viable feature that would allow a linear classifier to have zero error on the dataset (classify all examples correctly), and 4. the weight vector that results from adding this new feature.
Problem 2: Predicting Movie Ratings

Suppose that we are now interested in predicting a numeric rating for movie reviews. We will use a non-linear predictor that takes a movie review $x$ and returns $\sigma(\mathbf w \cdot \phi(x))$, where $\sigma(z) = (1 + e^{-z})^{-1}$ is the logistic function that squashes a real number to the range $(0, 1)$. Suppose that we wish to use the squared loss. For this problem, assume that the movie rating $y$ is a real-valued variable in the range $[0, 1]$.
Note: Do not use Wolfram Alpha or other math software to solve this problem.

1. Write out the expression for $\text{Loss}(x, y, \mathbf w)$ for a single datapoint $(x,y)$.
2. A mathematical expression for the loss. Feel free to use $\sigma$ in the expression.
3. Compute the gradient of the loss with respect to w.
Hint: you can write the answer in terms of the predicted value $p = \sigma(\mathbf w \cdot \phi(x))$.
4. A mathematical expression for the gradient of the loss.
5. Suppose there is one datapoint (x, y) with some arbitrary $\phi(x)$ and y = 1. Can you specify conditions for $\mathbf w$ to make the magnitude of the gradient of the loss with respect to $\mathbf w$ arbitrarily small (minimize the magnitude of the gradient)? If so, how small? Can the magnitude of the gradient with respect to $\mathbf w$ ever be exactly zero? You are allowed to make the magnitude of $\mathbf w$ arbitrarily large but not infinity.

Hint: try to understand intuitively what is going on and what each part of the expression contributes. If you find yourself doing too much algebra, you're probably doing something suboptimal.

Motivation: the reason why we're interested in the magnitude of the gradients is because it governs how far gradient descent will step. For example, if the gradient is close to zero when $\mathbf w$ is very far from the optimum, then it could take a long time for gradient descent to reach the optimum (if at all). This is known as the vanishing gradient problem when training neural networks.

6. A paragraph containing 1 to 2 sentence answers for each of the subquestions in the problem.
7. Assuming the same datapoint (x, y) as above, what is the largest magnitude that the gradient can take? Leave your answer in terms of $\|\phi(x)\|$. Prove that your chosen value is indeed the largest magnitude that the gradient can take.
8. A value for the largest magnitude of the gradient, followed by a proof. You should use mathematical notation in your proof, but can also make your argument in words.
Problem 3: Sentiment Classification

In this problem, we will build a binary linear classifier that reads movie reviews and guesses whether they are "positive" or "negative."

Do not import any outside libraries (e.g. numpy) for any of the coding parts. Only standard python libraries and/or the libraries imported in the starter code are allowed. In this problem, you must implement the functions without using libraries like Scikit-learn.
1. Implement the function extractWordFeatures, which takes a review (string) as input and returns a feature vector $\phi(x)$ (you should represent the vector $\phi(x)$ as a dict in Python).
2. Implement the function learnPredictor using stochastic gradient descent and minimize the hinge loss. Print the training error and validation error after each epoch to make sure your code is working. You must get less than 4% error rate on the training set and less than 30% error rate on the validation set to get full credit.
3. Create an artificial dataset for your learnPredictor function by writing the generateExample function (nested in the generateDataset function). Use this to double check that your learnPredictor works! You can do this by using generateDataset() to generate training and validation examples. You can then pass in these examples as trainExamples and validationExamples respectively to learnPredictor with the identity function (lambda x: x) as a featureExtractor.
4. When you run the grader.py on test case 3b-2, it should output a weights file and a error-analysis file. Look through some example incorrect predictions and for five of them, give a one-sentence explanation of why the classification was incorrect. What information would the classifier need to get these correct? In some sense, there's not one correct answer, so don't overthink this problem. The main point is to convey intuition about the problem. Note: you do not need to pick five different types of errors and explain each. It suffices to show five instances of the same type of error, and for each explain why the classification was incorrect.
5. 1) Five sample incorrect predictions, each with one sentence explaining why the classifications for these sentences was incorrect and 2) A single separate paragraph (3-5 sentences) outlining what information the classifier would need to get these predictions correct.
6. Now we will try a crazier feature extractor. Some languages are written without spaces between words. So is splitting the words really necessary or can we just naively consider strings of characters that stretch across words? Implement the function extractCharacterFeatures (by filling in the extract function), which maps each string of $n$ characters to the number of times it occurs, ignoring whitespace (spaces and tabs).
7. Run your linear predictor with feature extractor extractCharacterFeatures. Experiment with different values of $n$ to see which one produces the smallest validation error. You should observe that this error is nearly as small as that produced by word features. How do you explain this?

Construct a review (one sentence max) in which character $n$-grams probably outperform word features, and briefly explain why this is so.

Note: There is code in submission.py that will help you test different values of n. Remember to write your final written solution in sentiment.pdf.

A short paragraph (~4-6) sentences. In the paragraph state which value of $n$ produces the smallest validation error, why this is likely the value that produces the smallest error, and provide a review and explanation for when character $n$-grams probably outperform word features.
Problem 4: K-means clustering
Suppose we have a feature extractor $\phi$ that produces 2-dimensional feature vectors, and a toy dataset $\mathcal D_\text{train} = \{x_1, x_2, x_3, x_4\}$ with
1. $\phi(x_1) = [1, 0]$
2. $\phi(x_2) = [1, 2]$
3. $\phi(x_3) = [3, 0]$
4. $\phi(x_4) = [2, 2]$
1. Run 2-means on this dataset until convergence. Please show your work. What are the final cluster assignments $z$ and cluster centers $\mu$? Run this algorithm twice with the following initial centers:
1. $\mu_1 = [2, 3]$ and $\mu_2 = [2, -1]$
2. $\mu_1 = [0, 1]$ and $\mu_2 = [3, 2]$
Show the cluster centers and assignments for each step.
2. Implement the kmeans function. You should initialize your $k$ cluster centers to random elements of examples. After a few iterations of k-means, your centers will be very dense vectors. In order for your code to run efficiently and to obtain full credit, you will need to precompute certain quantities. As a reference, our code runs in under a second on cardinal, on all test cases. You might find generateClusteringExamples in util.py useful for testing your code. In this problem, you are not allowed to use libraries like Scikit-learn.
3. Sometimes, we have prior knowledge about which points should belong in the same cluster. Suppose we are given a set $S$ of example pairs $(i,j)$ which must be assigned to the same cluster. For example, suppose we have 5 examples; then $S = \{ (1, 5), (2, 3), (3, 4) \}$ says that examples 2, 3, and 4 must be in the same cluster and that examples 1 and 5 must be in the same cluster. Provide the modified k-means algorithm that performs alternating minimization on the reconstruction loss: $$\sum \limits_{i=1}^n \| \mu_{z_i} - \phi(x_i) \|^2$$ where $\mu_{z_i}$ is the assigned centroid for the feature vector $\phi(x_i)$.

Recall that alternating minimization is when we are optimizing two variables jointly by alternating which variable we keep constant. We recommend starting by first keeping $z$ fixed and optimizing over $\mu$ and then keeping $\mu$ fixed and optimizing over $z$.
4. A written explanation for the modified k-means algorithm, which includes a mathematical expression representing the modified cluster assignment update rule for the k-means steps.
5. What is the advantage of running K-means multiple times on the same dataset with the same K, but different random initializations?
6. A short (~1-3 sentences) written explanation.
7. If we scale all dimensions in our initial centroids and data points by some factor, are we guaranteed to retrieve the same clusters after running K-means (i.e. will the same data points belong to the same cluster before and after scaling)? What if we scale only certain dimensions? If your answer is yes, provide a short explanation. If it is no, provide a counterexample.
8. This response should have two parts. The first should be a yes/no response and explanation or counterexample for the first subquestion (scaling all dimensions). The first should be a yes/no response and explanation or counterexample for the first subquestion (scaling only certain dimensions).