CS221 Final Project Guidelines
In the final project, you will work in groups of up to four to apply the techniques that you've learned in
CS221 to a new setting that you're interested in. Note that regardless of the group size,
all groups must submit the work detailed in each milestone
and will be graded on the same criteria. Additionally, we expect each team to submit a completed project. We
encourage teams of 3-4 students. All projects require students to spend time gathering data and setting up the
infrastructure to reach an end result, and thus 3 or 4 person teams can share these tasks much better. This allows
the team to focus more on the interesting results and discussion in the project. Each member of the team should
contribute in both technical and non-technical components of the project.
You will build a system to solve a well-defined task.
Which task you choose is completely open-ended, but the methods you
use should draw on the ones from the course.
If you are interested in doing the class project, submit the project interest form any
time
before Oct 7.
We encourage all students who might be interested in a project to fill this form out. There is no commitment (you
can fill it out and then not follow through with doing a project). You can also refer to this Ed Post to find project
group team members.
After you submit the form, you will be assigned one of the CAs as your
official mentor. They will grade all your work and really get to know your project.
You are required to have a 15-minute check-in meeting with your mentor in the week before or after the progress
report deadline, and are encouraged to drop by your mentor's office hours often to discuss your project.
Note that it will take several iterations to find the right project, so be
patient; this exploration is an essential part of research, so learn from it.
Have fun and don't wait until the last minute!
Throughout the quarter, there will be several milestones so that you can get
adequate feedback on the project. While the project is ungraded (except for potential extra credit), we can only
give feedback on milestones that are submitted on time.
- Proposal (2 pages max). The proposal is due Oct 21. It should include the following items.
- Problem Statement and Task Definition: What does your system do (what is its input and output)?
What real-world problem does this system try to solve? Make sure that the scope of the project is
not too narrow or broad. For example, building a system to answer any natural language question is too
broad, whereas answering short factoid questions about movies is more reasonable.
- Input/Output Behavior with concrete examples of both the inputs and outputs (explain what
the
input and output variables should look like and how they interact with the system). You should collect
some preliminary data that you can use in your description of the input and output behavior. Specifically
reference what data you are using.
- An evaluation metric. How will you measure the success of your system? Why does this metric work
best for this problem? For this, you need to obtain a reasonably sized dataset of example
input-output pairs, either from existing sources, or collecting one from scratch. A natural evaluation
metric is accuracy, but it could be memory or running time.
- Related works. Search the Internet for similar projects and mention the related research and
projects.
- Baseline and Oracle. Before developing your primary approach, you should implement
baselines and oracles (See more about these in the FAQ at the bottom of the page). These are
really important as they give you intuition for how easy or hard the problem you're solving is.
Intuitively, baselines give lower bounds on the performance you will obtain and oracles give upper bounds.
If this gap is too small, then you probably don't have a good task. Importantly, baselines and oracles
should be relatively easy to implement and can be done before you invest a lot of time in implementing a
fancier approach. In this part of the proposal, you should describe what your baseline and oracle are.
While you do not need to implement it for the proposal, it is recommended that you have started
implementing them by this point.
- Methodology How will you approach the problem? Why does this method fit with your problem?
Identify the challenges of building the system and the phenomena in the data that you're trying to
capture. How should you model the task (e.g., using search, machine learning, logic, etc.)? There will be
many ways to do this, but you should pick one or two and determine how the methods address the challenges
as well as any pros and cons. What algorithms are appropriate for handling the models that you came up
with, and what are the tradeoffs between accuracy and efficiency? Are there any implementation choices
specific to your problem?
- Description of the challenges. What are the challenges? Which topics (e.g., search, MDPs, etc.)
might be able to address those challenges (at a high-level, since we haven't covered any techniques in
detail at this point)?
You should have the majority of the infrastructure (e.g., building a simulator,
cleaning data) completed to do something interesting by now. For machine learning tasks, setting up the
infrastructure involves collecting
data (either by scraping, using crowdsourcing, or hand labeling). For
game-based tasks, this involves building the game engine/simulator.
While infrastructure is necessary, try not to spend too much time on it.
You can sometimes take existing datasets or modify existing simulators to save time,
but if you want to solve a task you care about, this is not always an option.
Note that if you download existing datasets which are already preprocessed (e.g., Kaggle),
then you will be expected to do more with the project.
Note that you can still adjust your project topic after submitting the proposal,
but your progress report (described next) should be on the same topic as your final report.
- Progress report (4 pages max). The progress report is due on Nov 14. It should include the
following items.
- Introduction - Brief overview of your problem.
- Literature Review - Description of other work/papers you've found that are related to your
task.
Just mentioning a paper is not sufficient; you should at least go into brief detail about what kind of
approach they are using/how it relates to your work if it's not immediately clear. When looking at
relevant literature examine if there have there been other attempts to build a similar system. Compare
and
contrast your approach with existing work, citing the relevant papers. The comparison should be more
than
just high-level descriptions. You should try to fit your work and other work into the same framework.
Are
the two approaches complementary, orthogonal, or contradictory?
- Dataset - Description of data you are using - size of dataset, distribution of classes, any
preprocessing you needed to do.
- Baseline - Description and implementation of your baseline. Please provide a detailed
description
of your implemented baseline along with evaluation of the baseline using the metrics you define.
- Main approach - Propose a model and an algorithm for tackling your task. You should describe
the
model and algorithm in detail and use a concrete example to demonstrate how the model and algorithm
work.
Don't describe methods in general; describe precisely how they apply to your problem (what are the
inputs/outputs, variables, factors, states, etc.)?
- Evaluation Metric - Please include what metrics, both qualitative and quantitative, you are
using
to evaluate the success of your problem. If relevant please include equations to describe your metrics.
- Results & Analysis - At this point, you should have fully implemented your baseline and also
have
a basic working implementation of your main approach. Please include the performance of your baseline as
well as the performance of your main approach so far. Include an analysis of your results, and how this
might inform your next steps in fine-tuning your main approach.
- Future Work - This is not mandatory, but it might be helpful for your mentor to get an idea of
what next steps you plan to take after the milestone.
- References - Please include a reference section with properly formatted citations. (Not
included
in 4 page limit).
- Progress check-in with CA mentor (15-minute meeting).
The check-in should be completed by Nov 14. To provide more mentorship for the final project, we introduce a progress check-in between project groups and their assigned mentor.
In the week before or after the progress report deadline, you will have a 15-minute meeting with your mentor to discuss the progress
of your project, clear any confusion, and get advice. Your assigned CA mentor will reach out to give information on how to schedule
the meeting for check-in.
- Final Video (5 minutes): The final video is due on Dec 3.
The final video is a 5 minute video presentation of your project.
In the video, you should briefly describe the motivation, problem definition,
challenges, approaches, results, and analysis. You should include diagrams, figures and charts to illustrate
the highlights of your work. If possible, try to come up with creative visualizations of your project. These
could include system diagrams, more detailed examples of data that do not fit in the space of your report,
or
live demonstrations for end-to-end systems.
The goal of the video is to convey the important high-level ideas and give intuition
rather than be a super-detailed specification of everything you did.
Use lots of diagrams and concrete examples, and avoid slides that are too wordy or have extremely complex
equations.
- Final Report (5-10 pages max): The final report is due on Dec 3.
Your final report should be a comprehensive account of your project. This final report structure is very
similar to the progress report, except we would like to see some new results, experimentation and/or
analysis, and a Future Works section.
Below is a full description of what you should include in your project final report.
- Introduction - Brief overview of your problem. Why might this problem be important?
- Literature Review - Description of other work/papers you've found that are related to your
task.
Just mentioning a paper is not sufficient; you should at least go into brief detail about what kind of
approach they are using/how it relates to your work if it's not immediately clear. Please also mention
why
your work relates or differs from these related works.
- Dataset - Description of data you are using - size of dataset, distribution of classes, any
preprocessing you needed to do
- Baseline - Description and implementation of your baseline. For this report, you don't need to
go
too much into detail, but please still include some details.
- Main Approach - Propose a model and an algorithm for tackling your task. You should describe
the
model and algorithm in detail and use a concrete example to demonstrate how the model and algorithm
work.
Don't describe methods in general; describe precisely how they apply to your problem (what are the
inputs/outputs, variables, factors, states, etc.)?
- Evaluation Metric - Please include what metrics, both qualitative and quantitative, you are
using
to evaluate the success of your problem. If relevant please include equations to describe your metrics.
- Results & Analysis - At this point, you should have expanded on your approach from the progress
report. Please include the performance of your baseline as well as the performance of your main approach
so far and any experiments that you have run. Also, include an analysis of your results, and how this
might inform your next steps in fine-tuning your main approach. The analysis is very important, and it
requires you to think about what your results might mean.
- Error Analysis - Describe a few experiments that you ran that show the properties (both pros
and
cons) of your system. Analyze the data and show either graphs or tables to illustrate your point.What's
the take-away message? Were there any surprises? Use these experiments in the error analysis to describe
potential errors in the method and why they may have occurred.
- Future Work - We are requiring this section this time. This section can be short, but please
include some ideas about how you could improve your model if you had more time. This can also include
any
challenges you're running into and how you might fix it.
- Ethical Considerations - Provide a 1-2 paragraph statement outlining at least one ethical issue or
societal risk specific to your project, with an explanation of what in particular connects your project to
the ethical issue(s) or societal risk(s) raised. Subsequently, you also need to explain at least 1 possible
mitigation strategy for each of those issues (e.g. technical modifications, policy changes, or specific model
deployment measures). Note that you are not required to implement these mitigation strategies in your final project.
- Code - Please include a link to your Github/Bitbucket/etc. For private repos, make sure to communicate this with your mentor and get their Github ID to
add to your repo. If you choose to upload a zip, you can choose a subset of the data to upload so that
its size won't be too large.
- References - Please include a reference section with properly formatted citations (any format
of
your choice).
Note: you can have an appendix for each of the assignments the beyond the
maximum number of allowed pages with any figures, plots, or examples that you need. References do not count for the page limit.
Submit the milestones on Gradescope and make sure all group members
are added to the submission.
All milestones are due at
11:59pm.
For each milestone, you should submit:
proposal.pdf
, progress.pdf
, or final.pdf
containing a PDF of your
writeup.
For the Final Project Report, be sure to include a link to your code (uploaded to Github/Bitbucket/Google Drive/etc.) and data in the writeup. Any language is fine;
it does not have to run out-of-the-box. You should also include a README.md file within the repo/zip documenting
what everything is and what commands you ran.
We will give feedback on the following dimensions:
- Task definition: is the task precisely defined and is the motivation for the task clear?
In other words, does the world somehow become a better place if your project were successful?
We will reward projects that are extra thoughtful about how to use AI for social impact.
- Approach: was a baseline, an oracle, and an advanced method described clearly, well justified, and
tested?
- Data and experiments: have you explained the data clearly, performed systematic
experiments, and reported concrete results?
- Analysis: did you interpret the results and try to explain why things
worked (or didn't work) the way they did? Do you show concrete examples?
Of course, the experiments may not always be successful.
Getting negative results is normal,
and as long as you make a reasonably well-motivated attempt and you explained why the
results came out negative, you will get credit.
This is a suggestion of how to approach the final project with an example.
- Pick a topic that you're passionate about (e.g., food, language, energy,
politics, sports, card games, robotics).
As a running example, say we're interested in how people read the news to get their information.
- Brainstorm to find some tasks on that topic: ask "wouldn't it be nice to
have a system that does X?" or "wouldn't it be nice to understand X?"
A good task should not be too easy (sorting a list of numbers) and not too hard
(building a system that can automatically solve CS221 homeworks).
Please come to office hours for feedback on finding the right balance.
Let's focus on recommending news to people.
- Define the task you're trying to solve clearly and convince yourself (and a few friends) that it's
important/interesting.
Also state your evaluation metric – how will you know if you have succeeded or not?
Concentrate on a small set of popular news sites: nytimes.com, slashdot.org, sfgate.com, onion.com, etc.
For each user and each day, assume we have acquired a set of articles that the user is interested in
reading (training data).
Our task is to predict for a new day, given the full set of articles, the best subset to show the user;
evaluation metric would be prediction accuracy.
- Gather and clean the necessary data (this might involve scraping websites, filtering outliers, etc.).
This step can often take an annoyingly large amount of time if you're not careful, so do not try to get
bogged down here.
Simplify the task or focus on a subset of the data if necessary.
You might find yourself adjusting the task you're trying to solve based on new empirical insights you get by
looking at the data.
Notice that even if you're not doing machine learning, it's necessary to have data for evaluation purposes.
Write some scripts that download the RSS feeds from the news sites, run
some basic NLP processing (e.g., tokenization), say, using NLTK.
- Implement a baseline algorithm. For a classification task, this would be
always predicting the most common label. If your baseline is too high, then
your task is probably too easy.
One baseline is to always produce the first document from each news site.
Also implement an oracle, for example, recommending the document based on
the number of comments. This is an oracle because you wouldn't have the number
of comments at the time you actually wanted to recommend the article!
- Formulate a model and implement the algorithm for that model. You should
try several variants and compare them.
Remember to try as much as possible to separate model (what you
want to compute) from algorithms (how you do it).
You might train a classifier to predict, for each news article, whether to include it or not.
You might try to include these predictions as factors in a weighted CSP and
try to find a set of articles that balance diversity and relevance.
- Perhaps the most important part of the project is the final step, which
is to analyze the results. It's more important that you do a thorough
analysis and interpret your results rather than implement a huge number of
complicated heuristics in trying to eke out the maximum performance.
The analysis should begin with basic facts, e.g., how much time/memory did
the algorithm take, how does the accuracy vary with the amount of training
data? What are the instances that your system does the worst on? Give concrete examples
and try to understand why. Is there a bottleneck? Is it due to lack of training data?
You are free to use existing datasets, but these might be not necessarily the
best match for your problem, in which case you are probably better off making
your own dataset.
- Kaggle is a website that runs
machine learning competitions for predicting for monetary reward.
- Past CS229 projects:
examples of machine learning projects that you can look at for inspiration.
Of course your project doesn't have to use machine learning – it can draw from other areas of AI.
- SAT competition: satisfiability problems are a
special important class of CSPs.
- Natural language processing datasets:
links to many NLP datasets for different languages.
- OpenAI Gym: environments for reinforcement learning related
projects.
You are free to use existing tools for parts of your project as long as you're
clear what you used. When you use existing tools, the expectation is that you will do
more on other dimensions.
- Predict the price of airline ticket prices given day, time, location, etc.
- Predict the amount of electricity consumed over the course of a day.
- Predict whether the phone should be switched off / silenced based on sensor readings from your
smartphone.
- Auto-complete code when you're programming.
- Answer natural language questions for a restricted domain (e.g., movies, sports).
- Search for a mathematical theorem based on an expression which normalizes over variable names.
- Find the optimal way to get from one place on Stanford campus to another
place, taking into account uncertain travel times due to traffic.
- Solve Sudoku puzzles or crossword puzzles.
- Build an engine to play Go, chess, 2048, poker, etc.
- Break substitution codes based on knowledge of English.
- Automatically generate the harmonization of a melody.
- Generate poetry on a given topic.
You can also get inspiration from
previous years' CS221
projects
(student access only).
Can I use the same project for CS221 and another class (CS229, etc.)?
The short answer is that you cannot turn in the identical project for both
classes, but you can share common infrastructure across the two classes.
First, you should make sure that you follow the guidelines for the CS221
project, which are likely different from those of other classes.
Second, if any part of the project is done for a purpose outside CS221 (for the
final project in CS229 or other classes, or even for your own research), then
in the progress and final reports, you must clearly indicate which part of the
project was done for CS221 and which part was not.
For example, if you're taking CS229, then you cannot turn in the same pure
machine learning project for CS221. But you can work on the same broad problem
(e.g., news recommendation) for both classes and share the same dataset /
generic wrapper code. You should then explore the machine learning aspect of
the problem for CS229 (e.g., classifying news relevance) and another topic for
CS221 (e.g., optimizing diversity across news articles using search or CSPs).
Are there restrictions on who I can partner up with for the final project?
The only hard requirement is that each member of your group must be enrolled
in CS221. Thus, if you choose to use the same project for CS221 and another class,
all of your partners must be in CS221. If you feel like you have a compelling case for an exception,
please submit a request on Ed detailing the parts of the
project used for each class and the reasons for deviating from the project policies.
How do you choose a good baseline and oracle?
Baselines are simple algorithms, which might include using a small set of hand-crafted rules,
training a simple classifier, etc. (Note that baselines are extremely simple, but you might be
surprised at how effective they are.) While a predictor that guesses randomly provides a lower bound
(and can be reported in the paper), it is too simple and doesn't give you much information.
Predicting the majority label is a slightly less trivial baseline, and whether it's acceptable
depends on how insightful it is. For classification, if the different labels have very different
proportions, then it could be useful; otherwise it won't be. You are encouraged to have multiple
baselines. Please note that we expect an implementation of the baseline for the project progress
report (not project proposal).
Oracles are algorithms that "cheat" and look at the correct answer or involve humans. For
human-like classification problems (e.g., sentiment classification), you can have each member of
your project team try to annotate ~50 examples and measuring the agreement rate. Note that some
tasks are subjective, so even though humans are providing ground truth labels, human accuracy will
not be 100%. When the classification problem is not human-like, you can try to use the training
error of an expressive classifier (e.g., nearest neighbors) as a proxy for oracle error. The idea is
that if you can't even fit the training data using a very expressive classifier, there is probably a
lot of noise in your dataset, and you have a slim chance of building any classifier that does well
on test. While returning 100% is an upper bound, it is not a valid oracle since it is vacuously an
upper bound. Sometimes, oracles might be difficult to come by. If you think that no good oracles
exist, explain why. Please note that we do not expect an implementation of the oracle at any point
during your final project. However if the implementation is very easy, you are free to implement the
oracle.
Both baselines and oracles should be simple and not take much time. The point is not to do something
fancy, but to work with the data / problem that you have in a substantive way and learn something
from it. Here are some examples of baselines:
- Manually hand-code a few simple rules
- Train a classifier with some simple features
Guessing completely at random is technically a baseline, but is a really bad one because it doesn't
really tell you much about how easy the problem is.
Here are some examples of oracles:
- Manually have each member of your team hand label a handful of examples. The oracle is the
agreement rate between people.
- Measure the training accuracy using reasonable features. The idea is that if you can't even
fit your training data very well, then your test error is probably going to be even worse.
- Predict the label given information that you wouldn't normally have a test time (for example
predicting the future given other information from the future).
Always guessing the correct label is technically an oracle, but it's a really bad one, because you'd
always get 100% and you don't learn from it.
Overall, the main point of baselines and oracles is to get you to look at the problem
carefully and think about what's possible.
The accuracies of state-of-the-art systems on the dataset
could be either a baseline or an oracle.
Sometimes, there are data points that are neither baselines nor oracles:
for example, in a two component system, you use an oracle for one and a baseline for another.