Ensure that you're using Python version 3.12
. Check your Python version by running:
python --version
or
python3 --version
.exe
file to start the installation.chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
.pkg
file to start the installation.After installing Miniconda, set up your environment with the following commands:
conda create --name hw4 python=3.12
conda activate hw4
Markov decision processes (MDPs) can be used to model situations with uncertainty (which is most of the time in the real world). In this assignment, you will implement algorithms to find the optimal policy, both when you know the transitions and rewards (value iteration) and when you don't (reinforcement learning). You will use these algorithms on Mountain Car, a classic control environment where the goal is to control a car to go up a mountain.
In this problem, you will perform value iteration updates manually on a basic game to build your intuitions about solving MDPs. The set of possible states in this game is $\text{States} = \{-2, -1, 0, +1, +2\}$ and the set of possible actions is $\text{Actions}(s) = \{a_1, a_2\}$ for all states that are not end states. The starting state is $0$ and there are two end states, $-2$ and $+2$. Recall that the transition function $T: \text{States} \times \text{Actions} \rightarrow \Delta(\text{States})$ encodes the probability of transitioning to a next state $s'$ after being in state $s$ and taking action $a$ as $T(s'|s,a)$. In this MDP, the transition dynamics are given as follows:
$\forall i \in \{-1, 0, 1\} \subseteq \text{States}$,
In computer science, the idea of a reduction is very powerful: say you can solve problems of type A, and you want to solve problems of type B. If you can convert (reduce) any problem of type B to type A, then you automatically have a way of solving problems of type B potentially without doing much work! We saw an example of this for search problems when we reduce A* to UCS. Now let's do it for solving MDPs.
Let us define a new MDP with states $\text{States}' = \text{States} \cup \{ o \}$, where $o$ is a new state. Let's use the same actions ($\text{Actions}'(s) = \text{Actions}(s)$), but we need to keep the discount $\gamma' = 1$. Your job is to define transition probabilities $T'(s' | s, a)$ and rewards $\text{Reward}'(s, a, s')$ for the new MDP in terms of the original MDP such that the optimal values $V_\text{opt}(s)$ for all $s \in \text{States}$ are equal under the original MDP and the new MDP.
Hint: If you're not sure how to approach this problem, go back to the slide notes from this MDP lecture and closely read the slides on convergence toward the end of the deck.Now that we have gotten a bit of practice with general-purpose MDP algorithms, let's use them for some control problems. Mountain Car is a classic example in robot control [7] where you try to get a car to the goal located on the top of a steep hill by accelerating left or right. We will use the implementation provided by The Farama Foundation's Gymnasium, formerly OpenAI Gym.
The state of the environment is provided as a pair (position, velocity). The starting position is randomized within a small range at the bottom of the hill. At each step, the actions are either to accelerate to the left, to the right, or do nothing, and the transitions are determined directly from the physics of a frictionless car on a hill. Every step produces a small negative reward, and reaching the goal provides a large positive reward.
First, install all the dependencies needed to get Mountain Car running in your environment. Note that we are using Python 3.12 for this assignment. If you are getting "module not found" errors after downloading all the requirements, try using `pip` instead of `pip3` and `python` instead of `python3`.
pip3 install -r requirements.txt
Then to get the feel for the environment, test with an untrained agent which takes random action at each step and see how it performs.
python3 mountaincar.py --agent naive
You will see the agent struggling, not able to complete the task within the time limit.
In this assignment, you will train this agent with different reinforcement learning algorithms so that it can learn to climb the hill.
As the first step, we have designed two MDPs for this task. The first uses the car's continuous (position, velocity) state as is, and the second discretizes
the position and velocity into bins and uses indicator vectors.
Carefully examine ContinuousGymMDP
and DiscreteGymMDP
classes in util.py
and make sure you understand.
If we want to apply value iteration to the DiscreteGymMDP
(think about why we can't apply it to ContinuousGymMDP
),
we require the transition probabilities $T(s, a, s')$ and rewards $R(s, a, s')$ to be known. But oftentimes in the real world, $T$ and $R$ are unknown,
and the gym environments are set up in this way as well, only interfacing through the .step()
function. One method
to still determine the optimal policy is model-based value iteration, which runs Monte Carlo simulations to estimate $\hat{T}$ and $\hat{R}$, and then runs value iteration.
This is an example of model-based RL. Examine RLAlgorithm
in util.py
to understand the getAction
and incorporateFeedback
interface
and peek into the simulate
function to see how they are called repeatedly when training over episodes.
valueIteration
function in submission.py
.
submission.py
, implement ModelBasedMonteCarlo
which runs valueIteration
every calcValIterEvery
steps that the RL agent operates. The getAction
method controls how we use the latest policy as determined by value iteration,
and the incorporateFeedback
method updates the transition and reward estimates, calling value iteration when needed.
Implement the getAction
and incorporateFeedback
methods for ModelBasedMonteCarlo
.
python3 train.py --agent value-iteration
to train the agent using model based value iteration you implemented above and see the plot of reward per episode.
Comment on the plot, and discuss when might model based value iteration perform poorly. If your plot is not converging, double check your implementation of incorporateFeedback
.
You can also run python3 mountaincar.py --agent value-iteration
to visually observe how the trained agent performs the task now.
In the previous question, we've seen how value iteration can take an MDP which describes the full dynamics of the game and return an optimal policy, and we've also seen how model-based value iteration with Monte Carlo simulation can estimate MDP dynamics if unknown at first and then learn the respective optimal policy. But suppose you are trying to control a complex system in the real world where trying to explicitly model all possible transitions and rewards is intractable. We will see how model-free reinforcement learning can nevertheless find the optimal policy.
(state, action)
pairs. We learn the Q-value for each of these pairs using
the Q-learning update learned in class. In the TabularQLearning
class, implement the getAction
method which selects
action based on explorationProb
, and the incorporateFeedback
method which updates the Q-value given a (state, action)
pair.
0 | $$d_1x$$ | $$2d_1x$$ | |
---|---|---|---|
0 | 0 | $d_1x$ | $2d_1x$ |
$d_2y$ | $d_2y$ | $d_1x + d_2y$ | $2d_1x + d_2y$ |
$2d_2y$ | $2d_2y$ | $d_1x + 2d_2y$ | $2d_1x + 2d_2y$ |
fourierFeatureExtractor
in submission.py
. Looking at util.polynomialFeatureExtractor
may be useful
for familiarizing oneself with numpy.
(state, action)
by multiplying the extracted features from this pair with weights.
Unlike the tabular Q-learning in which Q-values are updated directly, with function approximation, we are updating weights associated
with each feature. Using fourierFeatureExtractor
from the previous part, complete the implementation of FunctionApproxQLearning
.
python3 train.py --agent tabular
and python3 train.py --agent function-approximation
to see the plots for how the TabularQLearning
and FunctionApproxQLearning
work respectively,
and comment on the plots. If your plot is not converging, double check your implementation of incorporateFeedback
. You should expect to see that tabular Q-learning performs better than function approximation Q-learning on this task. What might be some of the reasons?
You can also run python3 mountaincar.py --agent tabular
and python3 mountaincar.py --agent function-approximation
to visualize the agent trained with continuous and discrete MDP respectively.
TabularQLearning
and FunctionApproxQLearning
,
discussing possible reasons as well.
FunctionApproxQLearning
can be better in various scenarios.
We learned about different state exploration policies for RL in order to get information about $(s, a)$. The method implemented in our MDP code is epsilon-greedy exploration, which balances both exploitation (choosing the action $a$ that maximizes $\hat{Q}_{\text{opt}}(s, a))$ and exploration (choosing the action $a$ randomly). $$\pi_{\text{act}}(s) = \begin{cases} \arg\max_{a \in \text{Actions}}\hat{Q}_{\text{opt}}(s,a) & \text{probability } 1 - \epsilon \\ \text{random from Actions}(s) & \text{probability } \epsilon \end{cases}$$
In real-life scenarios when safety is a concern, there might be constraints that need to be set in the state exploration phase. For example, robotic systems that interact with humans should not cause harm to humans during state exploration. Safe exploration in RL is thus a critical research question in the field of AI safety and human-AI interaction [2]. You can learn more about safe exploration in this week's ethics contents.
Safe exploration can be achieved via constrained RL. Constrained RL is a subfield of RL that focuses on optimizing agents' policies while adhering to predefined constraints - in order to ensure that certain unsafe behaviors are avoided during training and execution.
Assume there are harmful consequences for the driver of the Mountain Car if the car exceeds a certain velocity. One very simple approach of constrained RL is to restrict the set of potential actions that the agent can take at each step. We want to apply this approach to restrict the states that the agent can explore in order to prevent reaching unsafe speeds.
self.max_speed
parameter. Read through OpenAI Gym's Mountain Car implementation and explain how self.max_speed
is used.
self.max_speed
parameter is used in mountain_car.py
max_speed
constraint by running python3 train.py --agent function-approximation --max_speed=100000
. We are changing max_speed
from its original value of 0.07 to a very large float to approximate removing the constraint entirely. Notice that running the MDP unconstrained doesn't really change the output behavior. Explain in 1-2 sentences why this might be the case.
ConstrainedQLearning
, implement constraints on the set of actions the agent can take each step such that velocity_(t+1) < velocity_threshold
.
Remember to handle the case where the set of valid actions is empty. Below is the equation that Mountain Car uses to calculate velocity at time step $t+1$ (the equation is also provided here).
$$\text{velocity}_{t+1} = \text{velocity}_t + (\text{action} - 1) * \text{force} - \cos(3 * \text{position}_t) * \text{gravity}$$
We've determined that in the mountain car environment, a velocity of 0.065 is considered unsafe. After implementing the velocity constraint, run grader test 5c-helper
to examine the optimal policy for two continuous MDPs running Q-Learning (one with max_speed
of 0.065 and the other with a very large max_speed
constraint). Provide 1-2 sentences explaining why the output policies are now different.
ConstrainedQLearning
in submission.py
, then run python3 grader.py 5c-helper
. Include 1-2 sentences explaining the difference in the two scenarios.
The Belmont report, one of the list of guidelines for conducting human-subjects research that arose after the Tuskegee Syphilis Study, lists three principles: Respect for Persons, Beneficence, and Justice [3]. In 2012, the US Department of Homeland Security applied these principles to describe how to conduct ethical information and communication technology research in the Menlo Report [6]. The principles are as follows:
[2] OpenAI. Benchmarking Safe Exploration in Deep Reinforcement Learning
[3] The Centers for Disease Control and Prevention. The Untreated Syphilis Study at Tuskegee Timeline.
[4] The US Department of Health and Human Services. The Belmont Report. 1978.
[5] The World Medical Association. The Declaration of Helsinki. 1964.
[6] The US Department of Homeland Security. The Menlo Report. 2012.
[7] Moore. Efficient Memory-based Learning for Robot Control. 1990.