Submission Instructions

The search algorithms explored in the previous assignment work great when you know exactly the results of your actions. Unfortunately, the real world is not so predictable. One of the key aspects of an effective AI is the ability to reason in the face of uncertainty.

Markov decision processes (MDPs) can be used to formalize uncertain situations. In this homework, you will implement algorithms to find the optimal policy in these situations. You will then formalize a modified version of Blackjack as an MDP, and apply your algorithm to find the optimal policy.

Problem 1: Value Iteration

In this problem, you will perform the value iteration updates manually on a very basic game just to solidify your intuitions about solving MDPs. The set of possible states in this game is $\mathcal{S} = \{-2, -1, 0, +1, +2\}$ and the set of possible actions is $\mathcal{A} = \{a_1, a_2\}$. The initial state is $0$ and there are two terminal states, $-2$ and $+2$. Recall that the transition function $\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(\mathcal{S})$ encodes the probability of transitioning to a next state $s'$ after being in state $s$ and taking action $a$ as $\mathcal{T}(s'|s,a)$. In this MDP, the transition dynamics are given as follows:

$\forall i \in \{-1, 0, 1\} \subset \mathcal{S}$,

Think of this MDP as a chain formed by states $\{-2, -1, 0, +1, +2\}$. In words, action $a_1$ has a 80% chance of moving the agent backwards in the chain and a 20% chance of moving the agent forward. Similarly, action $a_2$ has a 70% of sending the agent backwards and a 30% chance of moving the agent forward. We will use a discount factor $\gamma = 1$.
The reward function for this MDP is $\mathcal{R}(s,a,s') = \begin{cases} 20 & s' = -2 \\ 100 & s' = +2 \\ -5 & \text{otherwise} \end{cases}$

  1. Give the value of $V^\star_i(s)$ for each state in $\mathcal{S}$ after iterations $i = \{0, 1, 2\}$ of Value Iteration. Recall that $\forall s \in \mathcal{S}$, $V^\star_0(s) = 0$ and, for any terminal state $s_\text{terminal}$, $V^\star(s_\text{terminal}) = 0$. In words, all values are initialized to $0$ at iteration 0 and terminate states (for which the optimal policy is not defined) always have a value of $0$.
  2. Using $V^\star_2(\cdot)$, what is the corresponding optimal policy $\pi^\star$ for all non-terminal states?
Problem 2: General MDP Results

Equipped with an understanding of a basic algorithm for computing optimal value functions in MDPs, let's deepen our understanding of MDPs and prove a few general results.

In the parts that follow, the word "prove" means that we are expecting a formal, mathematical proof.

  1. We begin with a MDP $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{T}, \gamma \rangle$. To simplify things a bit, we will take the reward function to be a function on state-action pairs, rather than transitions ($\mathcal{R}:\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$). Consequently, we can write the Bellman equation as $V^\pi(s) = \mathcal{R}(s, \pi(s)) + \gamma \sum\limits_{s' \in \mathcal{S}} \mathcal{T}(s' | s,\pi(s)) V^\pi(s')$, for any policy $\pi:\mathcal{S} \rightarrow \mathcal{A}$. Let's also assume that $\mathcal{M}$ has no terminal states (the agent keeps selecting actions forever!).

    Suppose we have an upper bound on the reward received at any timestep, $\max\limits_{s,a} \mathcal{R}(s,a) = R_\text{MAX}$. Prove that for any state $s \in \mathcal{S}$, $V^\pi(s) \leq \frac{R_\text{MAX}}{(1-\gamma)}$.

    Hint: Recall that $V^\pi(s)$ is the expected utility obtained by following policy $\pi(s)$ starting from state $s$, where the utility is the discounted sum of rewards: $u = r_1 + \gamma r_2 + \gamma^2 r_3 + \ldots$ and so on. Remember that $\mathcal{M}$ never terminates, so this will be an infinite sum. Once you apply the upper bound on rewards, what pattern do you see? What do you know about this kind of mathematical expression that allows you to eliminate the sum over all timesteps? Remember that $\gamma \in [0,1)$.

  2. Keeping the setup from the previous part, now assume that you also have a lower bound of 0 on rewards: $0 \leq \mathcal{R}(s,a) \leq R_\text{MAX}, \forall s \in \mathcal{S},a \in \mathcal{A}$. Provide a simple modified reward function $\hat{\mathcal{R}}(s,a) = f(\mathcal{R}(s,a))$ and prove that the corresponding MDP is guaranteed to have $0 \leq V^\pi(s) \leq 1$, for any policy $\pi$. Finally, briefly explain (you should need at most two sentences) the relationship between the optimal policy of this new MDP and that of $\mathcal{M}$. Are they the same? Why or why not?

    Hint: For the second question, start by thinking about an MDP that always terminates after the agent takes exactly one step. Then, convince yourself that this answer holds for an agent acting for arbitrarily many steps.

  3. In the previous part, you made a modification to the reward function and assessed its effect on the optimal policy of the original MDP. Let's do this again for a different kind of reward function manipulation and actually prove that it preserves the optimal policy. Just as in the first problem, it will be easier for us to work with a reward function operating on transitions, $\mathcal{R}:\mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$.

    We will start with an initial MDP $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{T}, \gamma \rangle$ but would like to actually solve an MDP with an augmented reward function $\mathcal{M'} = \langle \mathcal{S}, \mathcal{A}, \mathcal{R'}, \mathcal{T}, \gamma \rangle$ where $\mathcal{R'}(s,a,s') = \mathcal{R}(s,a,s') + \mathcal{F}(s,a,s')$. Think of a scenario where $\mathcal{R}$ produces values of 0 for most transitions; a bonus reward function $\mathcal{F}:\mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$ that produces non-zero values could provide us more immediate feedback and help accelerate the learning speed of our agent. In this problem, we will focus on a particular type of reward bonus $\mathcal{F}(s,a,s') = \gamma\phi(s') - \phi(s)$, for some arbitrary function $\phi:\mathcal{S} \rightarrow \mathbb{R}$.

    First prove that $Q^\star_\mathcal{M}(s,a) - \phi(s) = Q^\star_\mathcal{M'}(s,a)$ and then use this fact to conclude that $\pi^\star_\mathcal{M'}(s) = \pi^\star_\mathcal{M}(s), \forall s \in \mathcal{S}$.

    Hint: Start by using $Q^\star_\mathcal{M}(s,a)$, the Bellman optimality equation for $\mathcal{M}$, to expand the LHS (left hand side) of the first claim. Notice that $a - b = a + c - c - b$ for arbitrary values $a, b, c$. Make an $a-b$ expression in what you have. What value $c$ could you insert that allows you to incorporate $\mathcal{F}(s,a,s')$? Relate the resulting Bellman equation back to $\mathcal{M'}$. Finish by recalling how to express the optimal policy $\pi^\star_\mathcal{M'}(s)$ in terms of the optimal action-value function $Q^\star_\mathcal{M'}(s,a)$.

Problem 3: Peeking Blackjack

Now that we have gotten a bit of practice with general-purpose MDP algorithms, let's use them to play (a modified version of) Blackjack. For this problem, you will be creating an MDP to describe states, actions, and rewards in this game. More specifically, after reading through the description of the state representation and actions of our Blackjack game below, you will implement the transition and reward function of the Blackjack MDP inside succAndProbReward().

For our version of Blackjack, the deck can contain an arbitrary collection of cards with different face values. At the start of the game, the deck contains the same number of each cards of each face value; we call this number the 'multiplicity'. For example, a standard deck of 52 cards would have face values $[1, 2, \ldots, 13]$ and multiplicity 4. You could also have a deck with face values $[1,5,20]$; if we used multiplicity 10 in this case, there would be 30 cards in total (10 each of 1s, 5s, and 20s). The deck is shuffled, meaning that each permutation of the cards is equally likely.

The game occurs in a sequence of rounds. In each round, the player has three actions available to her:

In this problem, your state $s$ will be represented as a 3-element tuple:

(totalCardValueInHand, nextCardIndexIfPeeked, deckCardCounts)
As an example, assume the deck has card values $[1, 2, 3]$ with multiplicity 1, and the threshold is 4. Initially, the player has no cards, so her total is 0; this corresponds to state (0, None, (1, 1, 1)).

The game continues until one of the following termination conditions becomes true:

As another example with our deck of $[1,2,3]$ and multiplicity 1, let's say the player's current state is (3, None, (1, 1, 0)), and the threshold remains 4.
  1. Implement the game of Blackjack as an MDP by filling out the succAndProbReward() function of class BlackjackMDP.
  2. Let's say you're running a casino, and you're trying to design a deck to make people peek a lot. Assuming a fixed threshold of 20, and a peek cost of 1, design a deck where for at least 10% of states, the optimal policy is to peek. Fill out the function peekingMDP() to return an instance of BlackjackMDP where the optimal action is to peek in at least 10% of states. Hint: Before randomly assinging values, think of the case when you really want to peek instead of blindly taking a card.
Problem 4: Learning to Play Blackjack

So far, we've seen how MDP algorithms can take an MDP which describes the full dynamics of the game and return an optimal policy. But suppose you go into a casino, and no one tells you the rewards or the transitions. We will see how reinforcement learning can allow you to play the game and learn its rules & strategy at the same time!

  1. You will first implement a generic Q-learning algorithm QLearningAlgorithm, which is an instance of an RLAlgorithm. As discussed in class, reinforcement learning algorithms are capable of executing a policy while simultaneously improving that policy. Look in simulate(), in util.py to see how the RLAlgorithm will be used. In short, your QLearningAlgorithm will be run in a simulation of the MDP, and will alternately be asked for an action to perform in a given state (QLearningAlgorithm.getAction), and then be informed of the result of that action (QLearningAlgorithm.incorporateFeedback), so that it may learn better actions to perform in the future.

    We are using Q-learning with function approximation, which means $\hat{Q}^\star(s, a) = \mathbb w \cdot \phi(s, a)$, where in code, $\mathbb w$ is self.weights, $\phi$ is the featureExtractor function, and $\hat{Q}^\star$ is self.getQ.

    We have implemented QLearningAlgorithm.getAction as a simple $\epsilon$-greedy policy. Your job is to implement QLearningAlgorithm.incorporateFeedback(), which should take an $(s, a, r, s')$ tuple and update self.weights according to the standard Q-learning update.

  2. Now let's apply Q-learning to an MDP and see how well it performs in comparison with value iteration. First, call simulate using your Q-learning code and the identityFeatureExtractor() on the MDP smallMDP (defined for you in submission.py), with 30000 trials. How does the Q-learning policy compare with a policy learned by value iteration (i.e., for how many states do they produce a different action)? (Don't forget to set the explorationProb of your Q-learning algorithm to 0 after learning the policy.) Now run simulate() on largeMDP, again with 30000 trials. How does the policy learned in this case compare to the policy learned by value iteration? What went wrong?
  3. To address the problems explored in the previous exercise, let's incorporate some domain knowledge to improve generalization. This way, the algorithm can use what it has learned about some states to improve its prediction performance on other states. Implement blackjackFeatureExtractor as described in the code comments. Using this feature extractor, you should be able to get pretty close to the optimum on the largeMDP.
  4. Sometimes, we might reasonably wonder how an optimal policy learned for one MDP might perform if applied to another MDP with similar structure but slightly different characteristics. For example, imagine that you created an MDP to choose an optimal strategy for playing "traditional" blackjack, with a standard card deck and a threshold of 21. You're living it up in Vegas every weekend, but the casinos get wise to your approach and decide to make a change to the game to disrupt your strategy: going forward, the threshold for the blackjack tables is 17 instead of 21. If you continued playing the modified game with your original policy, how well would you do? (This is just a hypothetical example; we won't look specifically at the blackjack game in this problem.)

    To explore this scenario, let's take a brief look at how a policy learned using value iteration responds to a change in the rules of the MDP. For all subsequent parts, make sure to use 30,000 trials.