Ensure that you're using Python version 3.12
. Check your Python version by running:
python --version
or
python3 --version
We recommend using uv
as it's much faster than pip and conda for managing Python environments and packages.
uv
is a modern, Rust-based package + project manager for Python. It keeps the familiar pip workflow but re-implements the engine for speed and reliability. Concretely: it creates a venv, resolves and installs dependencies with its own fast installer, and deduplicates files via a global cache (copy-on-write on macOS, hardlinks on Linux/Windows). It can also manage Python versions per project (e.g., pin 3.12) so each assignment uses a clean, reproducible interpreter. Think "pip + virtualenv + pip-tools + pyenv/pipx".
Please refer to the official uv installation documentation for the most up-to-date installation instructions for your platform.
Create and activate a virtual environment with the required dependencies:
# Install uv once curl -LsSf https://astral.sh/uv/install.sh | sh # Optional: `uv` binary by default goes to `$HOME/.local/bin` on Linux/macOS, # so you may need to add it to your PATH (uv may have done this for you): export PATH="$HOME/.local/bin:$PATH"
# Install uv once powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# Download the homework zip and unzip into `hw2_sentiment/` # In your hw directory uv init . # Initialize project (creates pyproject.toml) uv python pin 3.12 # Pin Python version uv add numpy einops torch # Add dependencies uv run python grader.py # Run the local grader
If you cannot run the assignment on your laptop or need additional computing resources, Stanford provides FarmShare, a community computing environment for coursework and unsponsored research. Please follow the instructions at https://docs.farmshare.stanford.edu/ to get started with the computing environment.
util.py
. Have a look
around in there before you start coding.
We've created a LaTeX template here for you to use that contains the prompts for each question.
Social media platforms like Twitter are rich sources of emotional expression. In this homework, you'll build classifiers to detect emotions in tweets using linear classification with cross-entropy loss and softmax activation. Consider the following dataset of 6 tweets, each labeled with exactly one emotion: joy (J), anger (A), or fear (F):
Tweet | Emotion | One-hot encoding |
---|---|---|
"amazing day" | Joy | [1, 0, 0] |
"scared of spiders" | Fear | [0, 0, 1] |
"love this" | Joy | [1, 0, 0] |
"so angry" | Anger | [0, 1, 0] |
"so so worried about tomorrow" | Fear | [0, 0, 1] |
"hate waiting" | Anger | [0, 1, 0] |
Each tweet $x$ is mapped to a feature vector $f(x)$ using bag-of-words representation (you can think of it as word counts). Machine learning models don't directly take a string as input; instead, as you have learned, they work with tensors (e.g. expressed in NumPy or PyTorch). One way to convert strings to tensors is “bag-of-words” features, which treat input texts as a "bag of words" and represents them using a vector with the same length as our vocabulary. The resulting representation is a vector, where each position corresponds to a word in the vocabulary, and each value represents how many times that word appears in the input text. If we have a very large vocabulary, we would end up with a very sparse vector with many 0's.
For this problem, let's assume our vocabulary consists of all unique words in the tweets above: {about, amazing, angry, day, hate, love, of, scared, so, spiders, this, tomorrow, waiting, worried}.
To build our multi-class classifier, we use a weight matrix $\mathbf{W} \in \mathbb{R}^{d \times 3}$ where $d$ is the feature dimension (vocabulary size) and 3 is the number of classes. The model computes logits as $\mathbf{z} = \mathbf{W}^T f(x)$ and applies softmax to convert logits into probability values that sum to 1. Let $\mathbf{p}$ be the 3-dimensional vector of output probabilities, and $p_k$ be the probability that the input belongs to class $k$. The softmax probabilities are computed in the following way: $$p_k = \frac{e^{z_k}}{\sum_{j=1}^{3} e^{z_j}}$$
Then, given the softmax probabilities, the classifier will use argmax to choose the prediction class with the highest $p_k$.
The cross-entropy loss measures the difference between a model's predicted probabilities and the true probability distribution of the data. For a single example (with three output classes), it can be computed as follows: $$\text{Loss}_{\text{CE}}(x, \mathbf{y}, \mathbf{W}) = -\sum_{k=1}^{3} y_k \log p_k$$ where $\mathbf{y}$ is the one-hot encoded true label vector.
In the previous problem, we used bag-of-words features for sentiment classification. It is now an outdated method of representing texts in machine learning due to many limitations. For example, it doesn't capture word relationships (e.g., "amazing" and "wonderful" are similar but treated as completely different features) and creates very sparse, high-dimensional vectors. In this problem, we'll explore how neural networks with word embeddings can address these issues.
We'll use the same set of tweets from Problem 1, but now represent each word with a dense 2-dimensional embedding vector. For simplicity, assume we have the following pre-trained word embeddings:
Word | Embedding |
---|---|
amazing | [0.8, 0.6] |
day | [0.2, 0.1] |
scared | [-0.5, -0.7] |
of | [0.0, 0.0] |
spiders | [-0.3, -0.4] |
love | [0.9, 0.4] |
this | [0.0, 0.0] |
so | [0.1, 0.0] |
angry | [-0.6, -0.8] |
worried | [-0.3, -0.6] |
about | [0.0, -0.1] |
tomorrow | [0.2, -0.2] |
hate | [-0.8, -0.5] |
waiting | [-0.1, -0.3] |
For simplicity, we will focus on binary sentiment classification (positive or negative) instead of three-class classification for this problem only.
For the sake of simplicity, we'll use a 2-input, 2-hidden neuron, 1-output architecture consisting of:
The forward pass equations are: $$\mathbf{h} = \text{ReLU}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$ $$z = \mathbf{W}^{(2)} \mathbf{h} + b^{(2)}$$ $$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$ where $\mathbf{x} = [x_1, x_2]^T$ is the 2D input representation, $\hat{y}$ is the predicted label, $\mathbf{W}^{(1)} \in \mathbb{R}^{2 \times 2}$, $\mathbf{b}^{(1)} \in \mathbb{R}^{2}$, $\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times 2}$, and $b^{(2)} \in \mathbb{R}$.
Network parameters:
$$\mathbf{W}^{(1)} = \begin{bmatrix} 1.0 & 0.5 \\ -0.5 & 1.0 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}$$
$$\mathbf{W}^{(2)} = \begin{bmatrix} 0.8 & -0.6 \end{bmatrix}, \quad b^{(2)} = 0.3$$
You should write your answers by copy and pasting the table below and filling its cells, or using bullet points to clearly state the value of each cell.
Node/Variable | Formula/Computation | Value |
---|---|---|
$x$ | NA | |
$\mathbf{h}$ | ||
$z$ | ||
$\hat{y}$ |
You should write your answers by copy and pasting the table below and filling its cells, or using bullet points to clearly state the value of each cell.
Variable | Formula/Computation | Value |
---|---|---|
$L$ |
Gradient | Formula (Chain Rule) | Value |
---|---|---|
$\frac{\partial L}{\partial \hat{y}}$ | ||
$\frac{\partial L}{\partial z}$ | ||
$\frac{\partial L}{\partial \mathbf{h}}$ | ||
$\frac{\partial L}{\partial \mathbf{W}^{(2)}}$ | ||
$\frac{\partial L}{\partial b^{(2)}}$ | ||
$\frac{\partial L}{\partial \mathbf{W}^{(1)}}$ | ||
$\frac{\partial L}{\partial \mathbf{b}^{(1)}}$ |
Now, we will implement the first classifier that can classify tweet sentiments. Recall that in our data, each tweet is labeled with one of three emotions (joy, anger, or fear) in one-hot encoding. For example:
text | one-hot label |
---|---|
i feel very happy and excited since i learned so many things | [1,0,0] |
i feel angered and firey | [0,1,0] |
i remember feeling acutely distressed for a few days | [0,0,1] |
build_vocabulary(examples)
, which builds a vocabulary using
words seen in the training examples. Your implementation should populate the Vocabulary
class provided in util.py
, which will
be used in part b to build a sparse feature representation for any input text.
build_vocabulary(examples)
using the Vocabulary class from util.py
.
text_to_features(text, vocab)
, which converts a text string into a sparse feature vector.
Reference 1a if you want to revisit the steps for this process.
numpy_softmax(logits)
using einops
and NumPy
library functions only.
numpy_cross_entropy_loss(predictions, targets, epsilon)
,
which returns a floating point number measuring how far the predicted probabilities are from the true one-hot labels,
on average.
numpy_compute_gradients(features, predictions, targets)
,
which returns the gradients for weight and bias, respectively. This is later used to update model weights using
backpropagation.
predict_linear_classifier(features, labels, weights, bias)
,
which makes predictions on input features using trained weights and bias, and computes the accuracy of the predictions.
train_linear_classifier(...)
,
which integrates all previous components into a complete training loop.
python submission.py --model linear
. The autograder requires that you achieve at least a 0.4 accuracy using default hyperparameters.
lr
, the learning rate, how do you expect loss and validation accuracy to change
during training? Run the terminal command python submission.py --model linear --lr [your_learning_rate]
to experiment with three different learning rates. Report what they are and what training behaviors
you observed. Why did the learning rate affect loss and accuracy this way?
python submission.py --model linear --lr 0.1
. If you don't include an --lr
argument, the default learning rate is 0.2.
lr
's you experimented with, and describe how they affected training behaviors.
Explain why the learning rate affects loss and accuracy this way.
Linear classification was a simple attempt at detecting tweet sentiments. In class, we learned that word embeddings and neural networks are powerful tools in natural language processing, and we will now implement them.
To build embeddings, we convert each word into a small, dense "embedding" vector (such as a 32-dimensional array of numbers) that captures the word's meaning. To obtain the embedding for each text, we simply average all the word embeddings to get one fixed-size vector per document - so a 500-word document and a 10-word document both become the same size vector, making them perfect inputs for your multilayer perceptron classifier while preserving much more semantic information than word counts alone.
We will implement a simple 2-layer neural network consisting of:
text_to_average_embedding(text, vocab, embedding_layer)
,
which converts a text string into a single embedding vector by averaging word embeddings. This creates a fixed-size representation
for variable-length tweets.
dtype=torch.long
).einops
)extract_averaged_features(texts, vocab, embedding_layer)
,
which processes a list of text strings and converts them all into a single tensor of averaged embeddings. This function
builds on your text_to_average_embedding
implementation to handle entire datasets at once.
text_to_average_embedding
function to get its averaged embedding(num_texts, embedding_dim)
where each row is one text's embeddingMLPClassifier
class.
torch_softmax(logits)
: Convert logits to probabilitiestorch_cross_entropy_loss(predictions, targets)
: A custom function that computes the average cross entropy loss of each training batchupdate_parameter(param, grad, lr)
: Manual gradient descent updatepredict_mlp(texts, labels, classifier, embedding_layer, vocab)
, which
evaluates a trained model on new text data.
train_mlp_classifier(...)
.
python submission.py --model mlp
.
The autograder requires that you achieve at least a 0.55 accuracy using default hyperparameters.
Submission is done on Gradescope.
Written: When submitting the written parts, make sure to select all the pages
that contain part of your answer for that problem, or else you will not get credit.
To double check after submission, you can click on each problem link on the right side, and it should show
the pages that are selected for that problem.
Programming: After you submit, the autograder will take a few minutes to run. Check back after
it runs to make sure that your submission succeeded. If your autograder crashes, you will receive a 0 on the
programming part of the assignment. Note: the only file to be submitted to Gradescope is submission.py
.
More details can be found in the Submission section on the course website.