Skip to main content

Reward Design

The rewards and their design affects the learning of methods. A well-designed reward system would actually let the agent learn what we want, instead of settling into unexpected behaviours.

Rewards

Goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal.

- Reinforcement Learning - An Introduction [Sutton and Barto]

Let the sequence of rewards at and after time step tt be Rt,Rt+1,Rt+2,…,RTR_t, R_{t+1}, R_{t+2}, \dots, R_T, where TT is the final time-step.

The cumulative reward is called return.

Gt=Rt+Rt+1+⋯+RT\large G_t = R_t + R_{t+1} + \dots + R_T

Faulty Reward Functions

Data Centre Cooling - Infinite Rewards

  • States: Temperature measurements
  • Actions: Fan Speeds
  • Rewards:
    • R = 0 for exceeding threshold
    • R = +1 for each second the system is cool
Gt=1+1+0+0+1+1+1+0+1+⋯=∑t=1∞=∞\large G_t = 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + \dots = \sum\limits_{t=1}^{\infty} = \infty

Even with non-optimal behaviour very often, the return can be infinite. This is a flawed reward design.


Cleaning Robot

  • States: Dust Sensors
  • Actions: Cleaning
  • Rewards:
    • R = 10 for cleaning a small room that takes 5 minutes
    • R = 100 for cleaning a large hall that takes 2 hours
  • Episode ends each day.

Here, rewards are finite, but the system is flawed. If the robot just cleaned the small room for the whole two hours, it will obtain a reward of R = 240, which is much greater than that of cleaning the large hall.

So, the robot learns to stick cleaning only the small room.


CoastRunners - OpenAI

Source: Faulty Reward Functions in the Wild - OpenAI Blog

This is a real situation encountered by the OpenAI Gym Team, highlighting how a flawed reward design can lead to unexpected situations.

The game is CoastRunners, where the goal is to finish the boat race quickly and (preferably) ahead of other players. The player can also earn scores by hitting targets laid along the route.

It turned out that the targets were laid out in such a way that the reinforcement learning agent could gain a high score without having to finish the course.

The RL agent finds an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets, timing its movement so as to always knock over the targets just as they repopulate.

The agent manages to achieve a higher score using this strategy than is possible by completing the course in the normal way.

This kind of behavior points to a more general issue with reinforcement learning: it is often difficult or infeasible to capture exactly what we want an agent to do, and as a result we frequently end up using imperfect but easily measured proxies. Often this works well, but sometimes it leads to undesired or even dangerous actions.


Discounting Rewards

We can get rid of the infinite returns by discounting.

The rewards at each further time step are discounted by a factor of γ\gamma.

Take 0≤γ≤10 \leq \gamma \leq 1, where γ\gamma is the discount factor.

Gt=Rt+γRt+1+γ2Rt+2+⋯=∑i=0∞γiRt+iG_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots = \sum\limits_{i=0}^{\infty} \gamma^i R_{t+i}

This discounting reduces the significance of rewards farther in the future.

Maximal Reward for R = +1

Discounting makes sums finite.

G0=∑k=0∞γk=11−γ\large G_0 = \sum\limits_{k = 0}^{\infty} \gamma^k = \frac{1}{1-\gamma}

Mathematical Convenience
Gt=Rt+γ(Rt+1+γRt+1+… )   ⟹  Gt=Rt+γ Gt+1\Large G_t = R_t + \gamma (R_{t+1} + \gamma R_{t+1} + \dots) \\ \ \\ \implies \boxed{G_t = R_t + \gamma\ G_{t+1}}