Installation Guide for Homework Environment

Prerequisites:

Ensure that you're using Python version 3.12. Check your Python version by running:

    python --version
    

or

    python3 --version
    

Installing uv (Recommended Python Package Manager):

We recommend using uv as it's much faster than pip and conda for managing Python environments and packages.

What is uv?

uv is a modern, Rust-based package + project manager for Python. It keeps the familiar pip workflow but re-implements the engine for speed and reliability. Concretely: it creates a venv, resolves and installs dependencies with its own fast installer, and deduplicates files via a global cache (copy-on-write on macOS, hardlinks on Linux/Windows). It can also manage Python versions per project (e.g., pin 3.12) so each assignment uses a clean, reproducible interpreter. Think "pip + virtualenv + pip-tools + pyenv/pipx".

Installing uv:

Please refer to the official uv installation documentation for the most up-to-date installation instructions for your platform.

Setting Up the Homework Environment with uv:

Create and activate a virtual environment with the required dependencies:

macOS/Linux:

# Install uv once
curl -LsSf https://astral.sh/uv/install.sh | sh

# Optional: `uv` binary by default goes to `$HOME/.local/bin` on Linux/macOS,
# so you may need to add it to your PATH (uv may have done this for you):
export PATH="$HOME/.local/bin:$PATH"

Windows:

# Install uv once
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

All platforms:

# Download the homework zip and unzip into `hw2_sentiment/`
# In your hw directory
uv init .                     # Initialize project (creates pyproject.toml)
uv python pin 3.12            # Pin Python version
uv add numpy einops torch     # Add dependencies
uv run python grader.py       # Run the local grader

Running on Stanford FarmShare (Optional)

If you cannot run the assignment on your laptop or need additional computing resources, Stanford provides FarmShare, a community computing environment for coursework and unsponsored research. Please follow the instructions at https://docs.farmshare.stanford.edu/ to get started with the computing environment.


Advice for this homework:
  1. Words are simply strings separated by whitespace.
  2. You might find some useful functions in util.py. Have a look around in there before you start coding.

We've created a LaTeX template here for you to use that contains the prompts for each question.

Problem 1: Building Intuition for Bag-of-Words Features and Linear Classification

Social media platforms like Twitter are rich sources of emotional expression. In this homework, you'll build classifiers to detect emotions in tweets using linear classification with cross-entropy loss and softmax activation. Consider the following dataset of 6 tweets, each labeled with exactly one emotion: joy (J), anger (A), or fear (F):

Tweet Emotion One-hot encoding
"amazing day" Joy [1, 0, 0]
"scared of spiders" Fear [0, 0, 1]
"love this" Joy [1, 0, 0]
"so angry" Anger [0, 1, 0]
"so so worried about tomorrow" Fear [0, 0, 1]
"hate waiting" Anger [0, 1, 0]

Each tweet $x$ is mapped to a feature vector $f(x)$ using bag-of-words representation (you can think of it as word counts). Machine learning models don't directly take a string as input; instead, as you have learned, they work with tensors (e.g. expressed in NumPy or PyTorch). One way to convert strings to tensors is “bag-of-words” features, which treat input texts as a "bag of words" and represents them using a vector with the same length as our vocabulary. The resulting representation is a vector, where each position corresponds to a word in the vocabulary, and each value represents how many times that word appears in the input text. If we have a very large vocabulary, we would end up with a very sparse vector with many 0's.

For this problem, let's assume our vocabulary consists of all unique words in the tweets above: {about, amazing, angry, day, hate, love, of, scared, so, spiders, this, tomorrow, waiting, worried}.

To build our multi-class classifier, we use a weight matrix $\mathbf{W} \in \mathbb{R}^{d \times 3}$ where $d$ is the feature dimension (vocabulary size) and 3 is the number of classes. The model computes logits as $\mathbf{z} = \mathbf{W}^T f(x)$ and applies softmax to convert logits into probability values that sum to 1. Let $\mathbf{p}$ be the 3-dimensional vector of output probabilities, and $p_k$ be the probability that the input belongs to class $k$. The softmax probabilities are computed in the following way: $$p_k = \frac{e^{z_k}}{\sum_{j=1}^{3} e^{z_j}}$$

Then, given the softmax probabilities, the classifier will use argmax to choose the prediction class with the highest $p_k$.

The cross-entropy loss measures the difference between a model's predicted probabilities and the true probability distribution of the data. For a single example (with three output classes), it can be computed as follows: $$\text{Loss}_{\text{CE}}(x, \mathbf{y}, \mathbf{W}) = -\sum_{k=1}^{3} y_k \log p_k$$ where $\mathbf{y}$ is the one-hot encoded true label vector.

  1. Feature representation: Using the vocabulary given above, write out the bag-of-words feature vector $f(x)$ for the tweet "so so worried about tomorrow".
    A feature vector corresponding to the vocabulary: {about, amazing, angry, day, hate, love, of, scared, so, spiders, this, tomorrow, waiting, worried}.
  2. Softmax computation: Given logits $\mathbf{z} = [2.0, 1.0, -1.0]$ for the three classes [Joy, Anger, Fear], compute the softmax probabilities. Show your work.
    The softmax probability vector $[P(\text{Joy}), P(\text{Anger}), P(\text{Fear})]$ with values rounded to 3 decimal places.
  3. Cross-entropy Loss: For the tweet "so angry" with true label Anger (one-hot: [0, 1, 0]), suppose your model outputs probabilities $[0.2, 0.7, 0.1]$. Calculate the cross-entropy loss for this example. Then explain what happens to the loss as the predicted probability for the correct class approaches 1.
    1. The numerical cross-entropy loss value
    2. A brief explanation (2-3 sentences) of the loss behavior
  4. Gradient analysis: Derive $\frac{\partial \text{Loss}_{\text{CE}}}{\partial z_k}$, the gradient of the cross-entropy loss with respect to the logit $z_k$. Show the mathematical steps to find $\frac{\partial \text{Loss}_{\text{CE}}}{\partial z_k}$ and express your final answer in terms of $p_k$ and $y_k$ only. Then, explain intuitively why this expression makes sense for gradient-based learning.
    1. Mathematical derivation showing the steps to compute $\frac{\partial \text{Loss}_{\text{CE}}}{\partial z_k}$. (Hint: you may need to use the chain rule.)
    2. Final gradient expression in terms of $p_k$ and $y_k$
    3. A brief intuitive explanation of why this gradient expression makes sense for learning (2-3 sentences)
Problem 2: Building Intuition for Embeddings and Multilayer Perceptron

In the previous problem, we used bag-of-words features for sentiment classification. It is now an outdated method of representing texts in machine learning due to many limitations. For example, it doesn't capture word relationships (e.g., "amazing" and "wonderful" are similar but treated as completely different features) and creates very sparse, high-dimensional vectors. In this problem, we'll explore how neural networks with word embeddings can address these issues.

We'll use the same set of tweets from Problem 1, but now represent each word with a dense 2-dimensional embedding vector. For simplicity, assume we have the following pre-trained word embeddings:

Word Embedding
amazing [0.8, 0.6]
day [0.2, 0.1]
scared [-0.5, -0.7]
of [0.0, 0.0]
spiders [-0.3, -0.4]
love [0.9, 0.4]
this [0.0, 0.0]
so [0.1, 0.0]
angry [-0.6, -0.8]
worried [-0.3, -0.6]
about [0.0, -0.1]
tomorrow [0.2, -0.2]
hate [-0.8, -0.5]
waiting [-0.1, -0.3]

For simplicity, we will focus on binary sentiment classification (positive or negative) instead of three-class classification for this problem only.

  1. Embeddings via averaging: One simple approach is to represent each tweet as the average of its word embeddings. Compute the averaged embedding representation for "so angry". Comment on one advantage and one disadvantage of this approach.
    1. A 2-dimensional averaged embedding vector
    2. Discuss one benefit and one limitations of creating embeddings for a text by averaging embeddings of its component words (2-3 sentences)
  2. Forward pass: Using the averaged representation for "so angry" from part (a), perform a forward pass step-by-step to find the predicted label $\hat{y}$. Show your intermediate calculations by filling in the table below.

    For the sake of simplicity, we'll use a 2-input, 2-hidden neuron, 1-output architecture consisting of:

    1. Input layer: Word embeddings $\mathbf{x}$
    2. Hidden layer: 2 neurons with ReLU activation, where $\text{ReLU}(\mathbf{x}) = \max(0, \mathbf{x})$
    3. Output layer: 1 neuron (binary sentiment classification) with sigmoid activation

    The forward pass equations are: $$\mathbf{h} = \text{ReLU}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$ $$z = \mathbf{W}^{(2)} \mathbf{h} + b^{(2)}$$ $$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$ where $\mathbf{x} = [x_1, x_2]^T$ is the 2D input representation, $\hat{y}$ is the predicted label, $\mathbf{W}^{(1)} \in \mathbb{R}^{2 \times 2}$, $\mathbf{b}^{(1)} \in \mathbb{R}^{2}$, $\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times 2}$, and $b^{(2)} \in \mathbb{R}$.

    Network parameters:
    $$\mathbf{W}^{(1)} = \begin{bmatrix} 1.0 & 0.5 \\ -0.5 & 1.0 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}$$ $$\mathbf{W}^{(2)} = \begin{bmatrix} 0.8 & -0.6 \end{bmatrix}, \quad b^{(2)} = 0.3$$

    You should write your answers by copy and pasting the table below and filling its cells, or using bullet points to clearly state the value of each cell.

    Node/Variable Formula/Computation Value
    $x$ NA
    $\mathbf{h}$
    $z$
    $\hat{y}$
    1. Compute forward pass values at each node. Show your calculations for each step by filling in the table above
  3. Backpropagation: Using the same weights, bias and input vector as in part (b), first compute the loss value $L$. Then, perform backpropagation to compute gradients $\frac{\partial L}{\partial \mathbf{W}^{(1)}}$, $\frac{\partial L}{\partial \mathbf{b}^{(1)}}$, $\frac{\partial L}{\partial \mathbf{W}^{(2)}}$, and $\frac{\partial L}{\partial b^{(2)}}$. Assume the true label is $y_{\text{true}} = 0$ (negative sentiment) and we're using binary cross-entropy loss: $$L = -[y_{\text{true}} \log(\hat{y}) + (1 - y_{\text{true}}) \log(1 - \hat{y})]$$

    You should write your answers by copy and pasting the table below and filling its cells, or using bullet points to clearly state the value of each cell.

    Variable Formula/Computation Value
    $L$
    Gradient Formula (Chain Rule) Value
    $\frac{\partial L}{\partial \hat{y}}$
    $\frac{\partial L}{\partial z}$
    $\frac{\partial L}{\partial \mathbf{h}}$
    $\frac{\partial L}{\partial \mathbf{W}^{(2)}}$
    $\frac{\partial L}{\partial b^{(2)}}$
    $\frac{\partial L}{\partial \mathbf{W}^{(1)}}$
    $\frac{\partial L}{\partial \mathbf{b}^{(1)}}$
    1. Compute the loss value $L$
    2. Starting from $\frac{\partial L}{\partial \hat{y}}$, compute gradients for each node working backwards and fill out the table given above. For the formula column, you should express each partial in terms of partials above it in the table (recall backpropagation from lecture!)
    3. Feel free to reuse values you already computed in part b
  4. Embedding analysis: Looking at the provided word embeddings, identify patterns in how positive emotion words (amazing, love), negative emotion words (scared, hate, angry), and neutral words (this, about, of) are positioned in the 2D embedding space. You don't have to plot anything for this problem; simply observe the patterns. What does this suggest about how word embeddings capture semantic relationships? Explain why embeddings might be more meaningful than the bag-of-words approach for capturing semantic relationships.
    1. Describe the spatial clustering of different word types (1-2 sentences)
    2. Explain how embeddings capture semantic similarity and its advantage over the bag-of-words approach (1-2 sentences)
Problem 3: Linear Classifier

Now, we will implement the first classifier that can classify tweet sentiments. Recall that in our data, each tweet is labeled with one of three emotions (joy, anger, or fear) in one-hot encoding. For example:

text one-hot label
i feel very happy and excited since i learned so many things [1,0,0]
i feel angered and firey [0,1,0]
i remember feeling acutely distressed for a few days [0,0,1]
  1. Build vocabulary: Implement build_vocabulary(examples), which builds a vocabulary using words seen in the training examples. Your implementation should populate the Vocabulary class provided in util.py, which will be used in part b to build a sparse feature representation for any input text.
    Implement build_vocabulary(examples) using the Vocabulary class from util.py.
  2. Text to features: Implement text_to_features(text, vocab), which converts a text string into a sparse feature vector. Reference 1a if you want to revisit the steps for this process.
  3. NumPy softmax: Implement numpy_softmax(logits) using einops and NumPy library functions only.
  4. NumPy cross-entropy loss: Implement numpy_cross_entropy_loss(predictions, targets, epsilon), which returns a floating point number measuring how far the predicted probabilities are from the true one-hot labels, on average.

  5. NumPy gradient computation: Implement numpy_compute_gradients(features, predictions, targets), which returns the gradients for weight and bias, respectively. This is later used to update model weights using backpropagation.

  6. Linear classifier predictor: Implement predict_linear_classifier(features, labels, weights, bias), which makes predictions on input features using trained weights and bias, and computes the accuracy of the predictions.

    Your implementation should:
    1. Compute logits by multiplying features with weights and adding bias
    2. Apply softmax to convert logits to probabilities
    3. Get predicted class labels by taking the argmax of probabilities
    4. Compare with true labels to compute accuracy
  7. Train linear classifier: Implement train_linear_classifier(...), which integrates all previous components into a complete training loop.

    Your implementation should:
    1. Initialize weights randomly and bias as a 0-vector
    2. For each epoch: compute forward pass, calculate loss, compute gradients, and update parameters correspondingly. At the end of the epoch, print out the training loss and validation accuracy
    3. Use your implemented softmax, cross-entropy loss, gradient, and predictor functions
    4. You should train the model on training data only; do not train the model on validation data.
    Finally, run python submission.py --model linear. The autograder requires that you achieve at least a 0.4 accuracy using default hyperparameters.
  8. If you adjust lr, the learning rate, how do you expect loss and validation accuracy to change during training? Run the terminal command python submission.py --model linear --lr [your_learning_rate] to experiment with three different learning rates. Report what they are and what training behaviors you observed. Why did the learning rate affect loss and accuracy this way?

    For example, you can run: python submission.py --model linear --lr 0.1. If you don't include an --lr argument, the default learning rate is 0.2.
    In 2-3 sentences, report three different lr's you experimented with, and describe how they affected training behaviors. Explain why the learning rate affects loss and accuracy this way.
Problem 4: Multilayer Perceptron and Embeddings

Linear classification was a simple attempt at detecting tweet sentiments. In class, we learned that word embeddings and neural networks are powerful tools in natural language processing, and we will now implement them.

To build embeddings, we convert each word into a small, dense "embedding" vector (such as a 32-dimensional array of numbers) that captures the word's meaning. To obtain the embedding for each text, we simply average all the word embeddings to get one fixed-size vector per document - so a 500-word document and a 10-word document both become the same size vector, making them perfect inputs for your multilayer perceptron classifier while preserving much more semantic information than word counts alone.

We will implement a simple 2-layer neural network consisting of:

  1. An input layer that takes averaged embeddings of size embedding_dim
  2. A hidden layer that uses ReLU activation
  3. An output layer that produces raw scores (logits) for each sentiment class
Note that this is different from the architecture described in problem 2.

  1. Text to average embedding: Implement text_to_average_embedding(text, vocab, embedding_layer), which converts a text string into a single embedding vector by averaging word embeddings. This creates a fixed-size representation for variable-length tweets.

    Your implementation should:
    1. Create a list of indices, with each position corresponding to the vocabulary index of the words in the input text.
    2. Create a PyTorch tensor from the list of indices (hint: use dtype=torch.long).
    3. Pass the tensor of indices through the embedding layer to get word embeddings
    4. Average the embeddings across the word dimension to get a single vector (hint: use einops)
    5. Return the averaged embedding as a tensor
  2. Extract averaged features: Implement extract_averaged_features(texts, vocab, embedding_layer), which processes a list of text strings and converts them all into a single tensor of averaged embeddings. This function builds on your text_to_average_embedding implementation to handle entire datasets at once.

    Your implementation should:
    1. Initialize an empty list to collect the embedding vectors
    2. Loop through each text string in the input list
    3. For each text, call your text_to_average_embedding function to get its averaged embedding
    4. Append each averaged embedding to your collection list
    5. Combine all individual embeddings into a single tensor
    6. Return a tensor of shape (num_texts, embedding_dim) where each row is one text's embedding
    Hint: This function essentially applies your single-text embedding function to every text in a batch, then organizes the results into the matrix format that your MLP classifier expects.
  3. MLP classifier: Complete the MLPClassifier class.
  4. Utility functions: Implement the helper functions tailored to our three-class prediction task.
  5. MLP predictor function: Implement predict_mlp(texts, labels, classifier, embedding_layer, vocab), which evaluates a trained model on new text data.

    Your implementation should:
    1. Set models to evaluation mode (no training)
    2. Extract averaged embeddings for the input texts
    3. Pass embeddings through the classifier to get predictions
    4. Compare predictions with true labels to compute accuracy
  6. Train MLP classifier: Implement train_mlp_classifier(...).

    This is the main training function that integrates everything together. Your implementation should:
    1. Build vocabulary from training texts
    2. Create and initialize embedding layer and MLP classifier
    3. Set up training loop with batch processing
    4. For each epoch: extract features, forward pass, compute loss, backpropagation, update parameters. At the start of each epoch, you should shuffle training data before processing them in batches. At the end of the epoch, print out the training loss and validation accuracy
    5. Use stochastic gradient descent to update all parameters; do not use optimizers (if you don't know what they are, that's perfectly fine!)
    6. You should train the model on training data only; do not train the model on validation data.
    You can test your implementation with this terminal command: python submission.py --model mlp. The autograder requires that you achieve at least a 0.55 accuracy using default hyperparameters.

Submission

Submission is done on Gradescope.

Written: When submitting the written parts, make sure to select all the pages that contain part of your answer for that problem, or else you will not get credit. To double check after submission, you can click on each problem link on the right side, and it should show the pages that are selected for that problem.

Programming: After you submit, the autograder will take a few minutes to run. Check back after it runs to make sure that your submission succeeded. If your autograder crashes, you will receive a 0 on the programming part of the assignment. Note: the only file to be submitted to Gradescope is submission.py.

More details can be found in the Submission section on the course website.